Subject: FYI: YML: A Grand Unification of SAX and DOM? (fwd) From: "Clark C. Evans" <clark.evans@xxxxxxxxxxxxxxxxxxxx> Date: Fri, 3 Dec 1999 02:16:20 -0500 (EST) |
Hello everyone, I'd like to carry on a sub-discussion of SML called YML on the SML list. It is of particular importance to the interaction between XML and XSL. Clark ---------- Forwarded message ---------- Date: Fri, 3 Dec 1999 02:13:25 -0500 (EST) From: Clark C. Evans <clark.evans@xxxxxxxxxxxxxxxxxxxx> To: sml-dev@xxxxxxxxxxx Subject: YML: A Grand Unification of SAX and DOM? Paul, I didn't want to speak up a few days ago -- claiming that I was going to do a grand unfication of SAX and DOM (even though this was exactly what I was thinking may be the case). On Thu, 2 Dec 1999, Paul Tchistopolskii wrote: > > Isn't it the end of long discussion of Elements vs Attributes? > > Now when I see the question: "Should I use attributes or > > elements?" - I know the answer: > > > > "If you want it to be processed by current APIs not keeping > > the entire docuemnt in the memory - use attributes everywhere > > you can." Exactly... It is a matter of what type of access you would like to have when processing the information -- you have two choices: Random Access (DOM) or Sequential Access (SAX) The trick, however, is subtle. You don't want _all_ random access nor do you want _all_ sequential access. You want a ballance. And this is where the binary doubly recursive pattern comes in. On Thu, 2 Dec 1999, G. Ken Holman wrote: > I posit that expressing properties of hierarchical components as > sub-elements of ancestry does not work well in information design > for the following reasons: > > (1) - I claim that the information in <b> has no (and should have no) > direct relation to the information in <e>, but that the information in > attr1= may or may not have direct relation to the information in <e> > because of descendent scope (<e> is a descendant of <a> but not of <b> > so how could <b>'s "influence" be construed as impacting on <e>?) > > (2) - when I am processing <e> (say with XSLT and XPath) I can very > easily determine properties of <e> by examining ascendent places in > the hierarchy (<a> is an ascendant of <e> therefore <a> and its > attributes are easily addressed via the ancestor:: axis) > - if I didn't have attributes and I was obliged to use sub-elements, > the extra processing involved to examine all child elements of all > ancestors for possible applicable properties would be both unwieldy > and ambiguous This is a great summary. The last day I jotted down a rough sketch of the Idea I've had running thorugh my head the last few days. It is posted at http://clarkevans.com/yml.html With a text version below. It's *far* from perfect, but I wanted to get the idea out as a cohesive unit so we can work on it as a community. I'd like to hear what you all think... Clark --------------------------------- YML - The Why Markup Language ---------------------------------- Authors: Clark Evans History: Version .1, 03-DEC-1999 Summary YML is currently an assembly of thoughts regarding the creation of a doubly recursive markup language and parser description. YML is an extension of the simple markup language ("SML"), which is a strict subset of the extensible markup language ("XML"). Further, YML is a unification of the XML document object model ("DOM") and the simple application programming interface for XML ("SAX"). Motivation YML was motivated from two reoccurring debates on the XML list, under the titles "SAX vs DOM" and "element vs attribute". It is interesting how they are interwoven. The SAX vs DOM debate often centers around which is better for processing information: random access method (RAM) or a sequential access method (SAM). Those from the DOM camp state that having the entire document in memory makes things easy to program; while those from the SAX camp point to efficiencies of stream processing. The element vs attribute debate is concerned with the distinction between an element's content and its attributes. One camp believes that the difference reflects an clear contextual binary decomposition, while the other camp views the distinction as syntax sugar. These debates are subtly linked since SAX provides an accepted, de-facto interpretation -- included with sequential access for each element, is a random access collection of its attributes. This interpretation is of huge value, as it ties the element vs attribute debate to a more tangeable processing concern, sequential access or random access. SAX is of interest for one other reason. It does not notify the processor individually for each attribute -- instead, it waits until it has collected each and every attribute before providing them as a single collection. This is in sharp contrast to its treatment of elements, which are handled individually, one by one. Another Motivation: Transformation Languages The real value in XML isn't just data representation or ease of parsing, it is the promise of a transformation language expressed in XML itself. The XML style sheet language ("XSL") is one approach to markup language transformations. XSL is the composite of many wonderful constructs, lessened by a few particularly bad restrictions. The delightful recursive template matching system is XSL's claim to fame. XSL is a collection of such templates, where a match clause identifies an expression which will trigger particular elements (and not attributes) to be processed according to the rules provided. These 'xpath' expressions define multiple axis. The ancestor axis is the most important, it is the current element stack. Of secondary importance is the attribute axis, which allows access to an element's attributes. These axis together allow for a very powerful way to identify and process elements. One disturbing aspect of XSL is the inclusion of the forward and previous axes in this xpath expression syntax. Furthermore, loop constructs and the ability to re-visit elements was also included in the language. For an XSL processor to reasonably support these features, random access to the information is a requirement. This is a problem for large-size information sets or low-memory processing devices. There are a few individuals who are contemplating a stream based alternative to XSL which will work without these large memory restrictions. Assuming that SAX was the underlying basis for such a processor; the only items available in random-access memory at any given time would be an element stack, including each element's attributes. This is hardly good enough to be efficient. An extension to a minimal processor build on top of SAX, could enable those elements on the "previous" axis -- as long as they are mentioned somewhere in the stylesheet. A smart collector could identify an element which must be used later, pinning it for random access in the future. Unfortunately, this method would not work well with dynamically generated xpath expressions. There are many other concerns as well, such as how to accomplish sorting, repeat performances, and other clear benifits that the random access brings. However, so far, there has not been a clear approach. Direction The goal of YML is to be a building block upon which an alternative to XSL can be built. One which is more space efficient than XSL, yet one does not sacrifice time as a pure stream based alternative appears to do. To accomplish this, memory must be managed differently at the parser level; thus a new parser description ("PD") must be provided -- one that ballances the constraints of SAX with the power of DOM. And, to accomplish this, the syntax of the markup language ("ML") itself must be substantially altered. Strictly speaking, the ML could easily be an XML extension, however, the data model presented here would be too hard to grok with all of XML's subtleties. These are serious changes, however, if it is possible to unify SAX and DOM, and perhaps enable the generation of a better transformation processor, it may be worth it. Background Consider these included by reference: http://www.lists.ic.ac.uk/hypermail/xml-dev/xml-dev-Nov-1999/1120.html http://www.lists.ic.ac.uk/hypermail/xml-dev/xml-dev-Nov-1999/1136.html http://www.lists.ic.ac.uk/hypermail/xml-dev/xml-dev-Nov-1999/1205.html http://www.egroups.com/group/sml-dev/31.html http://www.egroups.com/group/sml-dev/89.html Development Consider an enhanced SAX parser with an element stack enabling the new XSL processor random access to the entire ancestor and attribute axis, with sequential access otherwise. Consider further: <root r="x" > <s1/> <s2/> </root> Here, both sequential access nodes s1 and s2 have random access to the node r. However, these nodes cannot access each other since they are provided sequentially: When the s1 is visited, s2 has not yet been provided. Also, when s2 is visited, s1 has already been dropped from memory. Note, that this is recursive: <root r="x"> <s sr="a"> <ss/> </s> </root> Here it is clear that the node ss can access node s, sr, and r. So far so good. Lets enumerate the possible node types: s, r, sr, ss, ssr, sss, sssr, ... Notice that given the current XML syntax, and this processing model, random access nodes with children are not allowed. In other words, a given xpath may only consist of sequentially accessed nodes followed by an optional, random access tail node. Meat It is shown below how a change in XML's syntax to permit recursion on the attribute axis would allow a parser to be built having random access nodes allowing children. It is hypothesized that this syntax change would allow a construction of a parser that could be used in lieu of both DOM and SAX, giving random access or sequential access in a context sensitive manner, as a function of the source information. It is further hypothesized that this parser could be used to build a processor that has most of XSL's advantages without sacrificing performance. The Change Consider the following syntax (due to John Cowan): <root <r <rr/> > <rs/> </r> > <s <sr/> > <ss/> </s> </root> With this change, it is possible to generate all of the possible node types: r s rr rs sr ss rrr rrs rsr rss srr srs ssr sss ... This may not be the prettiest syntax; however, XML becomes a sub-set of this new syntax -- where the following definition is used for backwards compatability. <el att="val" /> <=> <el <att>val</att> /> And perhaps allowing the following syntax sugar is used for nested attributes (due to Colas Nahaboo): <el att=<ch>val</ch> /> <=> <el <att <ch>val</ch>/>/> Further, there should be no problem for XML parsers to enable the recognition of the new syntax since the above are since neither of the above expressions are well-formed. Thus, this is the basis for a completely different parser behavior that alternates between random access or serial access depending upon the type of node which is encountered, according to something like the following: interface yml-node { boolean is-random(); } interface yml-branch extends yml-node { String name(); } interface yml-leaf extends yml-node { String text(); } interface yml-stack { yml-node current(); yml-stack parent(); // list of random children int count(); yml-stack child(int i); // private yml-stack(yml-stack parent, yml-node current); add( yml-stack child ); complete(); } interface yml-output { void push( yml-stack element ); void leaf( yml-stack element ); void pop( yml-stack element ); } interface yml-input { // to be defined } void yml-process( yml-input in, yml-output out, yml-stack stack, boolean is-random ) { if(in.peek-is-leaf()) { yml-stack top = new yml-stack( stack , new yml-leaf(is-random,in.next()); if(is-random) stack.add(top); else out.leaf(top); return; } // it's a branch yml-stack top = new yml-stack( stack, new yml-branch(is-random, in.next()); while(in.inside-the-tag()) yml-process(in,out,top,true); top.complete(); if(!is-random) out.push(top); while(in.outside-the-tag()) yml-process(in,out,top,false); if(is-random) stack.add(top); else out.pop(top); return; } Thus, if the entire output uses the "random recursion" extreme, with only a single node (the top one) being sequential, then this method looks very much like DOM. In the other extreme, if "sequential recursion" is used, with an occasional attribute, then this method looks very much like SAX. However, if the input stream is a unique mixture then the result is surprising: the parser configures its memory usage subject to the structure of the information being processed. Thus, a unified parser is created. For an transformation processor built on top of this type of parser, it motivates an additional 'random' axis. Define a sequential node as one visited by the procesor's interface, and is dropped from memory afterwords. Define a random node as one not visited by the processor's interface, but made available through a random access method. Access of random nodes by sequential nodes is provided by the following rules: (a) Sequential nodes may reference its or any of its sequential ancestors's random siblings. (b) If a sequential node may reference a random node, then it may also reference any random children of the random node. Furthermore, if XML syntax compliance is absolutely needed -- the attribution notion could be used to mark random nodes: <root> <r random-access="yes"> <rr random-access="yes"/> <rs/> <r> <s> <sr random-access="yes" /> <ss/> </s> </root> Alternatively, the distinction between sequential or random nodes could be detailed in a DTD or some other schema document. However, I feel all of these are kluges and that John Cowan's syntax is the best expression of the idea. So the syntax becomes a bit more complicated... maybye. I say make the lower level a bit more complicated... to simplify everything else. It may be a ways off, however, I believe that this binary recursive method provides an novel and unexpected approach to making information processors more efficient. Best Wishes, Clark Evans Credits Too many to mention, first the xml-dev and sgml-dev and xsl lists; filled with smart people. Second, to Dan Palanza for introducing me to binary recursive models. Further, to the huge amount of philosophical, and technical literature out there regarding programming and systems theory that has shaped the manner in which I approach problems. Thanks! XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
RE: Xpath expression: foo[-bar], Kay Michael | Thread | SAXON 5.0 is available, Kay Michael |
IE 5.5: XSLT and XPath?, Jelks Cabaniss | Date | Re: New twist: eliminating nodes wi, Clark C. Evans |
Month |