Subject: RE: [xsl] Efficient recursive grouping in XSLT 1.0? From: "Michael Kay" <mhk@xxxxxxxxx> Date: Fri, 19 Mar 2004 00:34:09 -0000 |
Interesting problem. The thing that seems to make a dramatic difference to Saxon's speed on this transformation is to introduce a variable for the expression $parents/*, thus: <xsl:template name="subtree"> <xsl:param name="parents"/> <xsl:variable name="children" select="$parents/*"/> <xsl:for-each select="$children"> <xsl:variable name="current-group" select="$children[name() = name(current())]"/> <xsl:if test="count($current-group[1] | .) = 1"> The surprising thing in this stylesheet is that the number of recursive calls is not actually very high - it's the same as the number of elements output to the result tree, which doesn't increase greatly as the input file grows. But the size of $parents is linear with the source file size, which means that calculating $parents/* within the for-each loop is really bad news. I imagine that your rearrangement of the code somehow enabled MSXML to spot that it could move the subexpression out of the for-each loop; why it should do so on the rearranged code and not on the original is a mystery. Saxon is currently doing this kind of optimization at the XPath (and XQuery) level but not at the XSLT level. Michael Kay # -----Original Message----- # From: owner-xsl-list@xxxxxxxxxxxxxxxxxxxxxx # [mailto:owner-xsl-list@xxxxxxxxxxxxxxxxxxxxxx] On Behalf Of Ian Young # Sent: 18 March 2004 22:30 # To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx # Subject: [xsl] Efficient recursive grouping in XSLT 1.0? # # I have some fairly sizeable XML files (about 5MB - 15MB) that # I want to extract the element name structure from. By that, I # mean that I'll feed an arbitrary XML document in one end, and # get out an XML document that contains one element for every # distinct list of name() values in the ancestor-or-self axis, # retains the position in document order of the first # occurrence. I'd also like a count for each element output. # For the moment, I'm only interested in the elements: # attributes and text content are not important. # # For example, if I have this input document: # <root> # <a> # <b/><c/><d/><b/> # </a> # <b> # <a> # <a/> # </a> # </b> # <a> # <a/> # <b> # <a/> # <d/> # <a/> # <a/> # </b> # </a> # </root> # # I would want output that looks like this: # <root ct="1"> # <a ct="2"> # <b ct="3"> # <a ct="3"/> # <d ct="1"/> # </b> # <c ct="1"/> # <d ct="1"/> # <a ct="1"/> # </a> # <b ct="1"> # <a ct="1"> # <a ct="1"/> # </a> # </b> # </root> # # In XSLT 2.0 this can be readily achieved by using # for-each-group recursively. With Saxon 7.9 this stylesheet # will generate a result for one of my 7MB files (400k # elements) in 2 or 3 seconds (mostly parsing the input # document) to produce the output of about 1k elements. # # <xsl:stylesheet version="2.0" # xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> # <xsl:output method="xml" version="1.0" encoding="UTF-8" # indent="yes"/> # # <xsl:template match="/"> # <xsl:call-template name="subtree"> # <xsl:with-param name="parents" select="/"/> # </xsl:call-template> # </xsl:template> # # <xsl:template name="subtree"> # <xsl:param name="parents"/> # <xsl:for-each-group select="$parents/*" group-by="name()"> # <xsl:copy> # <xsl:attribute name="ct"> # <xsl:value-of select="count(current-group())"/> # </xsl:attribute> # <xsl:call-template name="subtree"> # <xsl:with-param name="parents" select="current-group()"/> # </xsl:call-template> # </xsl:copy> # </xsl:for-each-group> # </xsl:template> # </xsl:stylesheet> # # This is great, except that the system I want this to run in # is using libxml and libxslt (consequently, XSLT 1.0). # Now, transforming this into an XSLT 1.0 stylesheet can be # done by just replacing the for-each-group with a for-each and # checking if the context node is the first in $parents/* with # that name: # # <xsl:stylesheet version="1.0" # xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> # <xsl:output method="xml" version="1.0" encoding="UTF-8" # indent="yes"/> # # <xsl:template match="/"> # <xsl:call-template name="subtree"> # <xsl:with-param name="parents" select="/"/> # </xsl:call-template> # </xsl:template> # # <xsl:template name="subtree"> # <xsl:param name="parents"/> # <xsl:for-each select="$parents/*"> # <xsl:variable name="current-group" # select="$parents/*[name() = name(current())]"/> # <xsl:if test="count($current-group[1] | .) = 1"> # <xsl:copy> # <xsl:attribute name="ct"> # <xsl:value-of select="count($current-group)"/> # </xsl:attribute> # <xsl:call-template name="subtree"> # <xsl:with-param name="parents" select="$current-group"/> # </xsl:call-template> # </xsl:copy> # </xsl:if> # </xsl:for-each> # </xsl:template> # </xsl:stylesheet> # # But the performance of this on large files is hopeless. I # tried a smaller 1.3MB, 75k elements input file that generates # a 5kB, 125 element output. msxml took 2 minutes (memory usage # displaying odd sawtooth pattern # -- GC?), Saxon 7.9 over 5 minutes and libxslt 1.0.4 I gave up # on after 10 minutes. # # [Aside: I wondered whether it might be better to expanded # $current-group in the test and only created the variable # within the if element. The rationale being that the variable # might be created strictly whereas the combined xpath # expression might only be evaluated up to the first element. # # ... # <xsl:if test="count(($parents/*[name() = # name(current())])[1] | .) = 1"> # <xsl:variable name="current-group" # select="$parents/*[name() = name(current())]"/> # ... # # For msxsl makes an enormous difference: it's now faster than # Saxon 7 running the XSLT 2.0 stylesheet -- that's for the 7MB # file! How does _that_ work?! It doesn't make any great # difference to Saxon (or libxslt?).] # # Am I right that there's no way of using Muenchian grouping # for this because there's no possible XPath 1.0 expression for # the list of ancestors' node names? # # Are there any other approaches with XSLT 1 that might get a # decent performance with libxslt? # # One thing I thought of is to do the transform in two steps: # the first generates output isomorphic to the input elements, # but annotates each with its ancestor name list (as a string); # the second traverses that file using Muenchian grouping. This # works -- albeit with rather variable performance # -- in msxml. For the 1.3MB file step 2 takes 71 seconds in # Saxon and 75 in libxslt (and 1 second in msxml). This confuses me! # # TODO: Saxon gets java.lang.OutOfMemoryError on the second # stage reading the output of stage 1 for larger files. How do # I increase Java's memory limit? # # <!-- step 1 --> # <xsl:stylesheet version="1.0" # xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> # <xsl:output method="xml" version="1.0" encoding="UTF-8" # indent="yes"/> # # <xsl:template match="*"> # <xsl:param name="parent-path" select="''"/> # <xsl:variable name="p" select="concat($parent-path,'/',name())"/> # <xsl:copy> # <xsl:attribute name="p"> # <xsl:value-of select="$p"/> # </xsl:attribute> # <xsl:apply-templates> # <xsl:with-param name="parent-path" select="$p"/> # </xsl:apply-templates> # </xsl:copy> # </xsl:template> # # <xsl:template match="text()"/> # </xsl:stylesheet> # # <!-- step 2 --> # <xsl:stylesheet version="1.0" # xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> # <xsl:output method="xml" version="1.0" encoding="UTF-8" # indent="yes"/> # # <xsl:key name="p" match="*" use="@p"/> # # <xsl:template match="*"> # <xsl:if test="generate-id() = generate-id(key('p', @p)[1])"> # <xsl:copy> # <xsl:attribute name="ct"> # <xsl:value-of select="count(key('p', @p))"/> # </xsl:attribute> # <xsl:for-each select="key('p', @p)"> # <xsl:apply-templates/> # </xsl:for-each> # </xsl:copy> # </xsl:if> # </xsl:template> # # <xsl:template match="text()"/> # </xsl:stylesheet> # # Any comments, pointers, code suggestions gratefully received! # # Cheers, # # Ian. # # XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list # # XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
[xsl] Efficient recursive grouping , Ian Young | Thread | RE: [xsl] Efficient recursive group, Ian Young |
RE: [xsl] For-each iteration proble, Pieter Reint Siegers | Date | RE: [xsl] Efficient recursive group, Andreas L. Delmelle |
Month |