RE: [xsl] Efficient recursive grouping in XSLT 1.0?

Subject: RE: [xsl] Efficient recursive grouping in XSLT 1.0?
From: "Michael Kay" <mhk@xxxxxxxxx>
Date: Fri, 19 Mar 2004 00:34:09 -0000
Interesting problem.

The thing that seems to make a dramatic difference to Saxon's speed on this
transformation is to introduce a variable for the expression $parents/*,
thus:

  <xsl:template name="subtree">
    <xsl:param name="parents"/>
    <xsl:variable name="children" select="$parents/*"/>
    <xsl:for-each select="$children">
      <xsl:variable name="current-group"
          select="$children[name() = name(current())]"/>
      <xsl:if test="count($current-group[1] | .) = 1">

The surprising thing in this stylesheet is that the number of recursive
calls is not actually very high - it's the same as the number of elements
output to the result tree, which doesn't increase greatly as the input file
grows. But the size of $parents is linear with the source file size, which
means that calculating $parents/* within the for-each loop is really bad
news. I imagine that your rearrangement of the code somehow enabled MSXML to
spot that it could move the subexpression out of the for-each loop; why it
should do so on the rearranged code and not on the original is a mystery.
Saxon is currently doing this kind of optimization at the XPath (and XQuery)
level but not at the XSLT level.

Michael Kay

# -----Original Message-----
# From: owner-xsl-list@xxxxxxxxxxxxxxxxxxxxxx 
# [mailto:owner-xsl-list@xxxxxxxxxxxxxxxxxxxxxx] On Behalf Of Ian Young
# Sent: 18 March 2004 22:30
# To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
# Subject: [xsl] Efficient recursive grouping in XSLT 1.0?
# 
# I have some fairly sizeable XML files (about 5MB - 15MB) that 
# I want to extract the element name structure from. By that, I 
# mean that I'll feed an arbitrary XML document in one end, and 
# get out an XML document that contains one element for every 
# distinct list of name() values in the ancestor-or-self axis, 
# retains the position in document order of the first 
# occurrence. I'd also like a count for each element output.
# For the moment, I'm only interested in the elements: 
# attributes and text content are not important.
# 
# For example, if I have this input document:
# <root>
#   <a>
#     <b/><c/><d/><b/>
#   </a>
#   <b>
#     <a>
#       <a/>
#     </a>
#   </b>
#   <a>
#     <a/>
#     <b>
#       <a/>
#       <d/>
#       <a/>
#       <a/>
#     </b>
#   </a>
# </root>
# 
# I would want output that looks like this:
# <root ct="1">
#   <a ct="2">
#     <b ct="3">
#       <a ct="3"/>
#       <d ct="1"/>
#     </b>
#     <c ct="1"/>
#     <d ct="1"/>
#     <a ct="1"/>
#   </a>
#   <b ct="1">
#     <a ct="1">
#       <a ct="1"/>
#     </a>
#   </b>
# </root>
# 
# In XSLT 2.0 this can be readily achieved by using 
# for-each-group recursively. With Saxon 7.9 this stylesheet 
# will generate a result for one of my 7MB files (400k 
# elements) in 2 or 3 seconds (mostly parsing the input
# document) to produce the output of about 1k elements.
# 
# <xsl:stylesheet version="2.0"
#     xmlns:xsl="http://www.w3.org/1999/XSL/Transform";>
#   <xsl:output method="xml" version="1.0" encoding="UTF-8" 
# indent="yes"/>
# 
#   <xsl:template match="/">
#     <xsl:call-template name="subtree">
#       <xsl:with-param name="parents" select="/"/>
#     </xsl:call-template>
#   </xsl:template>
#   
#   <xsl:template name="subtree">
#     <xsl:param name="parents"/>
#     <xsl:for-each-group select="$parents/*" group-by="name()">
#       <xsl:copy>
#         <xsl:attribute name="ct">
#           <xsl:value-of select="count(current-group())"/>
#         </xsl:attribute>
#         <xsl:call-template name="subtree">
#           <xsl:with-param name="parents" select="current-group()"/>
#         </xsl:call-template>
#       </xsl:copy>
#     </xsl:for-each-group>
#   </xsl:template>
# </xsl:stylesheet>
# 
# This is great, except that the system I want this to run in 
# is using libxml and libxslt (consequently, XSLT 1.0).
# Now, transforming this into an XSLT 1.0 stylesheet can be 
# done by just replacing the for-each-group with a for-each and 
# checking if the context node is the first in $parents/* with 
# that name:
# 
# <xsl:stylesheet version="1.0"
#     xmlns:xsl="http://www.w3.org/1999/XSL/Transform";>
#   <xsl:output method="xml" version="1.0" encoding="UTF-8" 
# indent="yes"/>
# 
#   <xsl:template match="/">
#     <xsl:call-template name="subtree">
#       <xsl:with-param name="parents" select="/"/>
#     </xsl:call-template>
#   </xsl:template>
#   
#   <xsl:template name="subtree">
#     <xsl:param name="parents"/>
#     <xsl:for-each select="$parents/*">
#       <xsl:variable name="current-group"
#           select="$parents/*[name() = name(current())]"/>
#       <xsl:if test="count($current-group[1] | .) = 1">
#         <xsl:copy>
#           <xsl:attribute name="ct">
#             <xsl:value-of select="count($current-group)"/>
#           </xsl:attribute>
#           <xsl:call-template name="subtree">
#             <xsl:with-param name="parents" select="$current-group"/>
#           </xsl:call-template>
#         </xsl:copy>
#       </xsl:if>
#     </xsl:for-each>
#   </xsl:template>
# </xsl:stylesheet>
# 
# But the performance of this on large files is hopeless. I 
# tried a smaller 1.3MB, 75k elements input file that generates 
# a 5kB, 125 element output. msxml took 2 minutes (memory usage 
# displaying odd sawtooth pattern
# -- GC?), Saxon 7.9 over 5 minutes and libxslt 1.0.4 I gave up 
# on after 10 minutes.
# 
# [Aside: I wondered whether it might be better to expanded 
# $current-group in the test and only created the variable 
# within the if element. The rationale being that the variable 
# might be created strictly whereas the combined xpath 
# expression might only be evaluated up to the first element.
# 
#   ...
#   <xsl:if test="count(($parents/*[name() = 
# name(current())])[1] | .) = 1">
#     <xsl:variable name="current-group"
#         select="$parents/*[name() = name(current())]"/>
#     ...
# 
# For msxsl makes an enormous difference: it's now faster than 
# Saxon 7 running the XSLT 2.0 stylesheet -- that's for the 7MB 
# file!  How does _that_ work?! It doesn't make any great 
# difference to Saxon (or libxslt?).]
# 
# Am I right that there's no way of using Muenchian grouping 
# for this because there's no possible XPath 1.0 expression for 
# the list of ancestors' node names?
# 
# Are there any other approaches with XSLT 1 that might get a 
# decent performance with libxslt?
# 
# One thing I thought of is to do the transform in two steps: 
# the first generates output isomorphic to the input elements, 
# but annotates each with its ancestor name list (as a string); 
# the second traverses that file using Muenchian grouping. This 
# works -- albeit with rather variable performance
# -- in msxml. For the 1.3MB file step 2 takes 71 seconds in 
# Saxon and 75 in libxslt (and 1 second in msxml). This confuses me!
# 
# TODO: Saxon gets java.lang.OutOfMemoryError on the second 
# stage reading the output of stage 1 for larger files. How do 
# I increase Java's memory limit?
# 
# <!-- step 1 -->
# <xsl:stylesheet version="1.0"
#     xmlns:xsl="http://www.w3.org/1999/XSL/Transform";>
#   <xsl:output method="xml" version="1.0" encoding="UTF-8" 
# indent="yes"/>
# 
#   <xsl:template match="*">
#     <xsl:param name="parent-path" select="''"/>
#     <xsl:variable name="p" select="concat($parent-path,'/',name())"/>
#     <xsl:copy>
#       <xsl:attribute name="p">
#         <xsl:value-of select="$p"/>
#       </xsl:attribute>
#       <xsl:apply-templates>
#         <xsl:with-param name="parent-path" select="$p"/>
#       </xsl:apply-templates>
#     </xsl:copy>
#   </xsl:template>
# 
#   <xsl:template match="text()"/>
# </xsl:stylesheet>
# 
# <!-- step 2 -->
# <xsl:stylesheet version="1.0"
#     xmlns:xsl="http://www.w3.org/1999/XSL/Transform";>
#   <xsl:output method="xml" version="1.0" encoding="UTF-8" 
# indent="yes"/>
# 
#   <xsl:key name="p" match="*" use="@p"/>
# 
#   <xsl:template match="*">
#     <xsl:if test="generate-id() = generate-id(key('p', @p)[1])">
#       <xsl:copy>
#         <xsl:attribute name="ct">
#           <xsl:value-of select="count(key('p', @p))"/>
#         </xsl:attribute>
#         <xsl:for-each select="key('p', @p)">
#           <xsl:apply-templates/>
#         </xsl:for-each>
#       </xsl:copy>
#     </xsl:if>
#   </xsl:template>
# 
#   <xsl:template match="text()"/>
# </xsl:stylesheet>
# 
# Any comments, pointers, code suggestions gratefully received!
# 
# Cheers,
# 
# Ian.
# 
#  XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
# 
# 


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread