RE: Efficient Stylesheets for reordering

Subject: RE: Efficient Stylesheets for reordering
From: Jose Alberto Fernandez <JFernandez@xxxxxxxxxxx>
Date: Wed, 8 Nov 2000 13:54:33 -0800
> From: Jeni Tennison [mailto:mail@xxxxxxxxxxxxxxxx]
> 
> Jose,
> 
> > Am I completely wrong on my understanding of XSLT?
> 
> Almost ;) The big gap seems to be that you think that the XSLT
> processor is working on a stream of information that comes into it
> from the XML, and is writing out a stream of information concurrently.
> This is inaccurate.
> 
> What actually happens is that the XSLT processor parses the XML
> document and creates from it a tree representation of the document
> (the Document Object Model or DOM).  It stores this DOM in memory and
> operates over it to create another DOM (the result).  Then it
> (usually) serialises this DOM into an output file.
> 

Jeni, I understand the semantic model defined by XSLT, which is what you 
describe here. But as far as I understand, it is up to the implementations
to process the tree in some order, top-down in parallel, bottom-up, using a
DFS algorithm or whatever. 

It could be argued that streamming the transformation is nothing else than
being able to use a pure DFS algorith for the transformation. Granted, this
is
not always possible.

> One of the points about XSLT is that processors don't *have* to do all
> the processing sequentially: there is nothing in the language that
> prevents them from applying templates to s1 nodes, s2 nodes, s3 nodes
> and so on, all concurrently, and then constructing the result from
> that directly.
> 

I think one of the problems of XSLT is this idea that anything can fit in
memory
or "big documents are very rare, and can be very slow". In my case we need
to process very large documents which may grow in an order of magnitude by
the 
transformation. So you can imagine that the memory-in/memory-out paradigm
becomes a real problem. At the same time this is not a once a day batch file
kind of operation but it must happened concurrently with other requests of
the 
same type, which means memory utilization is even more important.

> Some processors allow asynchronous and partial transformation.  You
> could have a look at MSXSL (XSLProcessor) and Saxon (saxon:preview)
> which may be helpful.
> 

I would like to have some discussion, at the semantic level, on identifying
a subset of the XSLT language that can be process using a DFS approach.
Can a useful subset of the language be identified. By this I mean that,
unless indicated to be cached, once all template is applied to an element, 
such element node can be removed discarded.

Can engines use such analysis to optimize memory utilization?
Does XSLT provides enough information to be able to acomplish this.

> In your case, where you're just copying nodes across from one tree
> into a different structure, you can use xsl:copy-of rather than
> xsl:apply-templates:
> 
> <xsl:template match="doc1" >
   <doc2>
>   <xsl:copy-of select="s1[1]" />
>   <xsl:copy-of select="s3" />
>   <xsl:copy-of select="s4[1]" />
>   <xsl:copy-of select="s2[1]" />
>   <xsl:copy-of select="s5" />
   </doc2>
> </xsl:template>
> 

Well, the example was just to showcase the issues I am refering to. In
reality
each of the "s?" nodes will be further transformed.  

One of the things to notice here is that expressions like select="s3" do not

seem to be DFS friendly since you need to collect all "s3" nodes and hence
need to wait 
for all the "doc1" node to be in memory before you can continue to process
anything. 
In particular if there is no "s3" nodes you will not know even after finding

"s4" because XSLT has no idea of the input DTD. Which I think is one of the 
more STUPID things on the spec.

Now, assumming that we know the input DTD, can we write a better template
set
to obtain the expected result? 

   <!ELEMENT doc1 (s1, s2, s3*, s4?, s5+) >

Is there anything missing?

<xsl:template match="doc1" >
  <doc2>
  <xsl:apply-templates/>
  </doc2>
</xsl:template>

<xsl:template match="s1" >
  <xsl:copy-of select="." />
</xsl:template>

<xsl:template match="s2" > 
  <!-- variable 's2var' must have scope 'doc1' somehow -->
  <xsl:variable name="s2var" select="." />
</xsl:template>

<xsl:template match="s3" >
  <xsl:copy-of select="." />
</xsl:template>

<xsl:template match="s4" >
  <xsl:copy-of select="." />
</xsl:template>

<xsl:template match="s5[1]" >
  <xsl:copy-of select="$s2var" />
  <xsl:copy-of select="." />
</xsl:template>

<xsl:template match="s5[????]" > <!-- Some way to say position larger than 1
-->
  <xsl:copy-of select="$s2var" />
  <xsl:copy-of select="." />
</xsl:template>

I do not think it is possible to use s2var the way I am doing, 
but you can see that this set of templates can be easily process
in an streamming fashion. The only memory needed is that to store
the subtree "s2".

Now, just think that the actual XML document being process is a 'doc'
defined by:

  <!ELEMENT doc doc1* >

and the file contains 10 million records. This is the kind of issues people
would like to be able to solve using XSLT in a very efficient manner.

If XSLT is the wrong tool, are there any other XML transformation languages,

in the works, that may be better suited for this kind of work?

Jose Alberto


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread