THE DISPATCHER PROCESSING PARADIGM ---------------------------------- 1. Overview (and a short rant). Event-reporting parsing interfaces such as SAX and Expat have the shortcoming that a considerable amount of essentially similar state maintenance code gets reproduced from each application to the next. Any kind of context dependent processing requires state maintenance until the context terminates; the usual "answer" for this recurring need is to stuff a bunch of variables and stacks into the "handler object" and hope mightily that the resulting code doesn't look too much like spaghetti. The common extension of the basic interface to a set of specific routines keyed on element types may not suffice, again for the same basic reason: a bunch of independent start_this and end_this and start_that and end_that subroutines are not *related* to each other in any essential way. This makes sense for "streamed" or "tag soup" processing only - see a tag, do something, and forget - and offers nothing for substantial processing of deep structures. (In fact, the very circumstance that these subroutines have names based on munging is a giveaway that it's all kludgery.) 2. Shift-Reduce Parsing as an analogy. The surprising thing is that a relatively obvious fact seems to have been systematically ignored: tree structures are the natural products of shift-reduce parsing, so the natural parsing interface should be one that duplicates this mechanism. An element start event is thus a shift to a new parsing state, while an element end event is a reduce, where the element is reduced to some semantic value *relevant to its parent*, normally handled in "user code" before the goto action. These three basic components - shift, reduce, and goto - are the motivation for the Dispatcher paradigm, which is organized around three basic callback triggers - #start, #end, and #cont - for a protocol between the Dispatcher and element-specific event handlers. a. Corresponding to "states" that are "shift"-ed into, the Dispatcher maintains a stack of hashes. Each hash is a dispatch table, keying names (typically of element types) to handler subroutines for "start element" events. b. A start-element subroutine is expected to return a hash - the dispatch table for its own sub-context - which the Dispatcher will stack as the "current" table while this element is the immediately open element. This is the way in which each element type gets - in fact, is expected - to control its own subcontext. c. A start handler is expected to set its own end handler in the dispatch table it returns (this is part of the analogue of a user callback at the point of a "reduce"). The end handler is keyed by '#end': the initial '#' is to prevent clashes with the normal naming conventions for element types. Such an end handler is expected to return a semantic value, which the Dispatcher will undertake to pass to the parent. d. Since a "reduce" may require processing by the parent (which has become the current open element again), there is a need for a "child element finished" event - *this* piece is missing in other processing systems - and again, the start handler can provide a handler for this in its dispatch table. The key is '#cont', for "continuation". This handler will receive the semantic value returned by the #end handler of the immediately completed child element. e. A start handler can also set a specific handler for processing of character data in its subcontext (another very important context dependency that standard systems short-change with merely a common handler for *all* character data). The key is '#char'. 3. Advantages with Perl: using closures. Perl closures offer an excellent means to capture local state while subcontexts are processed, leading to a natural idiom, recursive or embeddable, like this, where continuations and end-element handlers are bound - by lexical scoping - to the *relevant* instance of a start-element event: sub parent { my ($handler_object, %attributes) = @_ ; my $child_values = [] ; return { '#end' => sub { my ($handler_object, $parent_name) = @_ ; return $child_values ; }, '#cont' => sub { my ($handler_object, $child_name, $child_value ) = @_ ; push @$child_values, $child_value ; }, child_name => \&child } } sub child { my ($handler_object, %attributes) = @_ ; my $child_value = '' ; return { '#end' => sub { my ($handler_object, $child_name) = @_ ; return $child_value ; }, '#char' => sub { my ($handler_object, $text) = @_ ; $child_value .= $text ; } } } The crucial feature is that the Dispatcher will pass the $child_value returned by the #end handler of the child as an argument of the #cont handler of the parent.