Re: [xsl] Applying Streaming To DITA Processing: Looking for Guidance

Subject: Re: [xsl] Applying Streaming To DITA Processing: Looking for Guidance
From: "Jirka Kosek jirka@xxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Thu, 9 Oct 2014 15:47:11 -0000
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 9.10.2014 16:16, Eliot Kimber ekimber@xxxxxxxxxxxx wrote:
> Can streaming help, either with overall processing efficiency or
> with memory usage?

Yes, the typical motivation for streaming is saving memory
consumption, in your case it's very unlikely that you can gain any
performance benefits.

> Where would I go today or in the near future to gain the
> understanding of streaming required to answer these questions
> (other than the XSLT 3 spec itself, obviously)?

There were several talks and papers presented in past years both at
XML Prague and Balisage conferences. For example:

https://www.youtube.com/watch?v=OeSQ4ompB1g&index=6&list=PLQpqh98e9RgXPGvJaNsE3b1Sqncz6MGvr

https://www.youtube.com/watch?v=kzGZvh-FbNw&list=PLQpqh98e9RgXPGvJaNsE3b1Sqncz6MGvr&index=7

If there is enough interested I can try to organize streaming workshop
or something like that as a part of XML Prague 2015 (http://xmlprague.cz)

> Because my data collection process is copying data to a new result,
> I'm pretty sure it's inherently streamable: I'm just processing
> documents in an order determined by a normal depth-first tree walk
> of the map structure (a hierarchy of hyperlinks to topics) and
> grabbing relevant data (e.g., division titles, figure titles, index
> entries, etc.). If this was all I was doing, then for sure
> streaming would help memory usage.
> 
> But because I must then process each topic again to generate the
> final result, and that process is not directly streamable, would
> streaming the first phase help overall?

You can split your transformation into two steps -- first will be
streamable and second will not. Compared to the current situation you
will save around 50% memory.

> Taken a step further: are there implementation techniques I could
> apply in order to make the second phase streamable (e.g.,
> collecting the information needed to render cross references
> without having to fetch the target elements) and could I expect
> that to then provide enough performance improvement to justify the
> implementation cost?

You can do this. You can process "compiled grand-source document" in a
streaming mode and make lookups in smaller document with
cross-referencing data in a non-streaming mode.

> The current code is both mature and relatively naive in its
> implementation. Reworking it to be streamable could entail a
> significant refactoring (maybe, that's part of what I'm trying to
> determine).
> 
> The actual data processing cost is more or less fixed, so unless
> streaming makes the XSLT operations faster, I wouldn't expect
> streaming by itself to reduce processing time.

It's very unlikely that streaming rewrite will make your code faster.
Of course lookups in a small cross-ref auxiliary file will be faster
than in a large document, but if you use keys today, it shouldn't be
very big difference.

> However, the primary concern in this use case is memory usage:
> currently, memory required is proportional to the number of topics
> in the publication, whereas it could be limited to simply the
> largest topic plus the size of the collected data itself (which is
> obviously much smaller than the size of the topics as it includes
> the minimum data needed to enable numbering and such).

I don't know how large is your documentation set, but I would be
surprised if it couldn't fit into memory (who would read it then? :-).
Streaming is generally useful when it's impossible to load documents
into memory -- which on current machines means processing gigabytes
large XML files.

					Jirka


- -- 
- ------------------------------------------------------------------
  Jirka Kosek      e-mail: jirka@xxxxxxxx      http://xmlguru.cz
- ------------------------------------------------------------------
       Professional XML consulting and training services
  DocBook customization, custom XSLT/XSL-FO document processing
- ------------------------------------------------------------------
 OASIS DocBook TC member, W3C Invited Expert, ISO JTC1/SC34 rep.
- ------------------------------------------------------------------
    Bringing you XML Prague conference    http://xmlprague.cz
- ------------------------------------------------------------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)

iEYEARECAAYFAlQ2re4ACgkQzwmSw7n0dR6shwCffITFOIsRjAVeUE+XI4c6vHmt
UEAAn1ssKI6bxGb59UYqi67McfirpoL1
=a1hq
-----END PGP SIGNATURE-----

Current Thread