Re: [xsl] Design of XML so that it may be efficiently stream-processed

Subject: Re: [xsl] Design of XML so that it may be efficiently stream-processed
From: Michael Kay <mike@xxxxxxxxxxxx>
Date: Fri, 22 Nov 2013 11:40:51 +0000
Firstly, I question the premise that XML should be designed to enable streamed
transformation. One could equally well argue that you should design it so it
doesn't need to be transformed at all. Transformation is only necessary
because the data isn't in the form you want it; designing it so that it can
easily be transformed into the form you want it seems a little odd. Unless
perhaps you are thinking of designing the intermediate formats in a processing
pipeline.

>
> 1. Use lots of attributes. Store in them the data needed for processing the
node.

Certainly for data that can conveniently be represented as attributes, this
will make streamed processing easier. But don't overdo it.
>
> 2. Have one child element only.

No, if there are two things that should naturally be represented as child
elements, then represent them that way. There are plenty of techniques still
available for streamed processing: accumulators, xsl:iterator, fold-left,
xsl:fork.
>
>
> So, to enable efficient stream processing, design XML like this:
>
> <root a="..." b="..." c="...">
>      <node d="..." e="..." f="...">
>            <node g="..." h="..." i="...">
>                  <node j="..." k="..." l="...">
>                        <node m="..." n="..." o="...">
>                              <node p="..." q="..." r="...">
>                                  ...
>                             </node>
>                        </node>
>                  </node>
>            </node>
>      </node>
> </root>
>
> This results in a massively deep tree. For Gigabyte-sized XML files, the
nesting could be a billion levels deep (or more).
>
No, such a design is completely bizarre and defeats the whole purpose of
streaming, which is to reduce memory use.

I would add some more important design criteria. Put metadata and reference
information (stuff that's needed for reference throughout document processing)
at the start of the document rather than the end, or in a separate document.
Use hierarchic nesting for relationships rather than id/idref style pointers
(even perhaps if it means holding the data redundantly).

Michael Kay
Saxonica

Current Thread