RE: [xsl] distinct-values() optimization, sorting by frequency

Subject: RE: [xsl] distinct-values() optimization, sorting by frequency
From: "Michael Kay" <mike@xxxxxxxxxxxx>
Date: Fri, 8 Feb 2008 14:48:28 -0000
In the alphabetical list,

count($persNames[normalize-space(lower-case(.)) =$current-name])"/

could be optimized by:

(a) using keys

(b) using Saxon-SA which will optimize it to use a key automatically

(c) using xsl:for-each-group rather than distinct-values(), though that will
require some restructuring of your code.

In the frequency-sorted list, I think for-each-group would definitely be
better:

<xsl:for-each-group select="$persNames" group-by="lower-case(.)">
  <xsl:sort select="count(current-group())"/>
  ...

(Note also the use of a case-blind collation rather than lower-case(),
discussed in another thread today)

Michael Kay
http://www.saxonica.com/


 

> -----Original Message-----
> From: James Cummings [mailto:cummings.james@xxxxxxxxx] 
> Sent: 08 February 2008 14:28
> To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
> Subject: [xsl] distinct-values() optimization, sorting by frequency
> 
> Hiya,
> 
> I'm wondering the best way to optimize a distinct-values() 
> based transformation.  What I'm basically doing is:
> ======
> <xsl:variable name="docs"  
> select="collection('../../working/xml/files.xml')"/>
> 
> <xsl:template name="main" >
>  <xsl:variable name="persNames" 
> select="$docs//tei:text//tei:persName"/>
>  <xsl:variable name="norm-persNames"
> select="$persNames/normalize-space(lower-case(.))"/>
>  <xsl:variable name="distinct-persNames"
> select="distinct-values($norm-persNames)"/>
> <!-- I realize that I could be more specific on the 
> $persNames variable, but doing so doesn't seem to affect 
> speed much at all. --> <div type="main">
> 
> <!-- Some overall counts -->
> <div><head>Overall Counts</head>
> <list type="unordered">
>   <item>Number of <gi>persName</gi> elements total:
>     <xsl:value-of select="count($persNames)"/></item>
>   <item>Number of <gi>persName</gi> elements which have a  
> @key attribute total: <xsl:value-of 
> select="count($persNames[@key])"/></item>
> <item>Number of distinct-value <gi>persName</gi> elements total:
> <xsl:value-of select="count($distinct-persNames)"/></item>
> </list></div>
> 
> <!-- An Alphabetical List -->
> <div><head>Alphabetical List</head>
>   <list type="unordered">
>     <xsl:for-each select="$distinct-persNames">
>       <xsl:sort select="."/>
>       <xsl:variable name="current-name" select="."/>
>       <xsl:variable name="count-distinct-current-name"
>      select="count($persNames[normalize-space(lower-case(.)) 
> =$current-name])"/>
>       <item><xsl:value-of select="concat($current-name,
>           '  --  ', $count-distinct-current-name)"/></item>
>       </xsl:for-each>
>    </list>
> </div>
> 
> <!-- A Frequency Sorted List  -->
> <div>
>   <head>Frequency List</head>
>   <list type="unordered">
>     <xsl:for-each select="$distinct-persNames">
>       <xsl:sort 
> select="count($persNames[normalize-space(lower-case(.))
>         = .])"/>
> <!-- I think it is this sort statement which slows things 
> down, since I have to repeat it twice. -->
>       <xsl:variable name="current-name" select="."/>
>       <xsl:variable name="count-distinct-current-name"
>         select="count($persNames[normalize-space(lower-case(.))
>         = $current-name])"/>
>       <item><xsl:value-of select="concat($count-distinct-current-name,
>           '  --  ', $current-name)"/> </item>
>     </xsl:for-each>
>   </list>
> </div>
> </div>
> ======
> 
> I think the real slow-down comes in the second xsl:for-each 
> where I want to sort by frequency of distinct-value by doing:
> <xsl:sort 
> select="count($persNames[normalize-space(lower-case(.)) = 
> .])"/> I have to have it for the sort, and then I have to 
> re-do it for the output inside the <item> element.  I'm 
> obviously not allowed a variable between the for-each and the 
> sort... but I have a feeling I'm missing some clever 
> optimization here.
> 
> Although this is for a pre-generated transformation, it 
> currently takes a *hugely* long time, and I'm thinking I must 
> be able to optimize it somehow.
> 
> Any suggestions appreciated,
> 
> -James

Current Thread