Re: [xsl] Applying Streaming To DITA Processing: Looking for Guidance

Subject: Re: [xsl] Applying Streaming To DITA Processing: Looking for Guidance
From: "Eliot Kimber ekimber@xxxxxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Thu, 9 Oct 2014 16:17:05 -0000
I think this is Jirk'as most interesting statement:

<quote>
I don't know how large is your documentation set, but I would be
surprised if it couldn't fit into memory (who would read it then? :-).
Streaming is generally useful when it's impossible to load documents
into memory -- which on current machines means processing gigabytes
large XML files.
</quote>

There are DITA documents that comprise essentially the entire
documentation set for a complex product or set of related products and
these are processed as a unit, e.g., to produce a documentation Web site
or online help set or similar. Documents with 10s of 1000s of topics is
not uncommon. Typically a topic will be about 2-4 K characters, so not big
individually, but large in the aggregate. But unlikely to be Gigabytes of
source. The current OT PDF process does merge the map and topics into a
single XML instance and I suppose that could be quite large, but like
Jirka said, who would read that? (As a baseline, the DITA spec itself,
which includes both the architectural spec and the language reference
consists of about 3200 topics measuring 135 MB on disk and producing a
1000+ page PDF document. That's large but not extreme by DITA standards.)

Using the DITA Open Toolkit, these documents can easily consume 2+ GB of
RAM, which I realize is not that large generally but can be large for
individual DITA users trying to run these processes on their personal
laptops.

So while the typical case is probably not too extreme, there is the
potential for very large scale at the upper end. I want to make sure I've
considered that case appropriately and at least know how it could be
addressed even if doing so proactively is not warranted.


In general I agree that memory consumption should not be a primary concern
at normal scales, and that's one reason that I implemented the DITA
process as described: by having all the topics in memory many things
become much easier to do. The original DITA Open Toolkit approach,
reflected in the current base code, is to do some map-based preprocessing
but then process each topic as a separate XSLT process. This minimizes
memory usage but makes certain types of processing (e.g., numbering across
a publication) difficult or impossible.

Thus my questions about the potential benefits of streaming are in part to
determine when it would actually make sense relative to the potential
implementation cost. One wants one's code to be as efficient as it should
be without over-optimization at the cost of code simplicity or ease of
extension or whatever.

Jirka's feedback suggests that trying to apply streaming is probably not
that compelling, especially as long as the output-generation phase cannot
itself be streaming.

Cheers,

Eliot
bbbbb

Eliot Kimber, Owner
Contrext, LLC
http://contrext.com




On 10/9/14, 10:47 AM, "Jirka Kosek jirka@xxxxxxxx"
<xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote:

>-----BEGIN PGP SIGNED MESSAGE-----
>Hash: SHA1
>
>On 9.10.2014 16:16, Eliot Kimber ekimber@xxxxxxxxxxxx wrote:
>> Can streaming help, either with overall processing efficiency or
>> with memory usage?
>
>Yes, the typical motivation for streaming is saving memory
>consumption, in your case it's very unlikely that you can gain any
>performance benefits.
>
>> Where would I go today or in the near future to gain the
>> understanding of streaming required to answer these questions
>> (other than the XSLT 3 spec itself, obviously)?
>
>There were several talks and papers presented in past years both at
>XML Prague and Balisage conferences. For example:
>
>https://www.youtube.com/watch?v=OeSQ4ompB1g&index=6&list=PLQpqh98e9RgXPGvJ
>aNsE3b1Sqncz6MGvr
>
>https://www.youtube.com/watch?v=kzGZvh-FbNw&list=PLQpqh98e9RgXPGvJaNsE3b1S
>qncz6MGvr&index=7
>
>If there is enough interested I can try to organize streaming workshop
>or something like that as a part of XML Prague 2015 (http://xmlprague.cz)
>
>> Because my data collection process is copying data to a new result,
>> I'm pretty sure it's inherently streamable: I'm just processing
>> documents in an order determined by a normal depth-first tree walk
>> of the map structure (a hierarchy of hyperlinks to topics) and
>> grabbing relevant data (e.g., division titles, figure titles, index
>> entries, etc.). If this was all I was doing, then for sure
>> streaming would help memory usage.
>>
>> But because I must then process each topic again to generate the
>> final result, and that process is not directly streamable, would
>> streaming the first phase help overall?
>
>You can split your transformation into two steps -- first will be
>streamable and second will not. Compared to the current situation you
>will save around 50% memory.
>
>> Taken a step further: are there implementation techniques I could
>> apply in order to make the second phase streamable (e.g.,
>> collecting the information needed to render cross references
>> without having to fetch the target elements) and could I expect
>> that to then provide enough performance improvement to justify the
>> implementation cost?
>
>You can do this. You can process "compiled grand-source document" in a
>streaming mode and make lookups in smaller document with
>cross-referencing data in a non-streaming mode.
>
>> The current code is both mature and relatively naive in its
>> implementation. Reworking it to be streamable could entail a
>> significant refactoring (maybe, that's part of what I'm trying to
>> determine).
>>
>> The actual data processing cost is more or less fixed, so unless
>> streaming makes the XSLT operations faster, I wouldn't expect
>> streaming by itself to reduce processing time.
>
>It's very unlikely that streaming rewrite will make your code faster.
>Of course lookups in a small cross-ref auxiliary file will be faster
>than in a large document, but if you use keys today, it shouldn't be
>very big difference.
>
>> However, the primary concern in this use case is memory usage:
>> currently, memory required is proportional to the number of topics
>> in the publication, whereas it could be limited to simply the
>> largest topic plus the size of the collected data itself (which is
>> obviously much smaller than the size of the topics as it includes
>> the minimum data needed to enable numbering and such).
>
>I don't know how large is your documentation set, but I would be
>surprised if it couldn't fit into memory (who would read it then? :-).
>Streaming is generally useful when it's impossible to load documents
>into memory -- which on current machines means processing gigabytes
>large XML files.
>
>					Jirka
>
>
>- --
>- ------------------------------------------------------------------
>  Jirka Kosek      e-mail: jirka@xxxxxxxx      http://xmlguru.cz
>- ------------------------------------------------------------------
>       Professional XML consulting and training services
>  DocBook customization, custom XSLT/XSL-FO document processing
>- ------------------------------------------------------------------
> OASIS DocBook TC member, W3C Invited Expert, ISO JTC1/SC34 rep.
>- ------------------------------------------------------------------
>    Bringing you XML Prague conference    http://xmlprague.cz
>- ------------------------------------------------------------------
>-----BEGIN PGP SIGNATURE-----
>Version: GnuPG v2.0.17 (MingW32)
>
>iEYEARECAAYFAlQ2re4ACgkQzwmSw7n0dR6shwCffITFOIsRjAVeUE+XI4c6vHmt
>UEAAn1ssKI6bxGb59UYqi67McfirpoL1
>=a1hq
>-----END PGP SIGNATURE-----

Current Thread