Grouping problem with large files in .Net

Subject: Grouping problem with large files in .Net
From: "Frederik Willaert" <f.w@xxxxxxxxxxx>
Date: Mon, 7 Jun 2004 01:46:35 +0200 (Romance Daylight Time)
Hi,
 
I have a problem with grouping large record-style XML documents using the 
Net XslTransform class.
 
My source document has the following structure:
 
<REPORT>
    <ROW>
        <CUSTOMER>XXX</CUSTOMER>
        <ACCOUNT>YYY</ACCOUNT>
        <HOURNUMBER>1</HOURNUMBER>
        <VALUE1>...</VALUE1>
        <VALUE2>...</VALUE2>
        <VALUE3>...</VALUE3>
        <!-- ... -->
    </ROW>
    <ROW>
            <!-- ... -->
    </ROW>
    <!-- ... -->
</REPORT>
 
 
The stylesheet I'm executing is the following:
 
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3
org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:key name="rows-by-customer" match="/REPORT/ROW" use="CUSTOMER"/>
<xsl:key name="rows-by-customer-and-account" match="/REPORT/ROW" use=
concat(CUSTOMER,'+',ACCOUNT)"/>
<xsl:template match="/REPORT">
    <Report>
        <xsl:for-each select="ROW[generate-id() = generate-id(key(
rows-by-customer', CUSTOMER)[1])]">
            <xsl:variable name="customer" select="CUSTOMER" />
            <Customer Name="{$customer}">
                <xsl:for-each select="key('rows-by-customer'
$customer)[generate-id() =
generate-id(key('rows-by-customer-and-account', concat(CUSTOMER,'+'
ACCOUNT))[1])]">
                    <xsl:variable name="account" select="ACCOUNT" />
                    <Account Name="{$account}">
                        <xsl:for-each select="key(
rows-by-customer-and-account',
concat(CUSTOMER,'+',$account))">
                            <xsl:copy-of select="." />
                        </xsl:for-each>
                    </Account>
                </xsl:for-each>
            </Customer>
        </xsl:for-each>
    </Report>
</xsl:template>
</xsl:stylesheet>
 
This performs a two-level grouping: by Customer, then by Account.
 
The source document can contain several tens of thousands of rows.
 
 
=> When performing this transformation using MSXML, performance is very
acceptible.< 1 minute for a file with 60000 records.
=> However, the same transformation using .Net (1.1) XslTranform seems to
take forever - haven't been able to have it processed completely so far...
Unfortunately, .Net is the intended platform.
 
==> Am I doing something wrong, is this a known problem, and/or can
something be done about this?
 
Remarks:
- I have also tried with the count(. | key('rows-by-customer', CUSTOMER)[1])
= 1 approach, same problem.
- I've found a document on MSDN mentioning that the xsl:key implementation
had a performance problem. However, this seems to apply to .Net v1.0 (?)
- Following recommendations, I'm using XPathDocument for the input file, and
a stream for the output - or would there be better options?
- I've included the source code for the transformation, and the timings of
several transformations (using MSXSL and XslTransform) below.
 
Any help would be greatly appreciated...
 
Thanks in advance,
Frederik
 
*****************
C# code to do transformation:
 
string folder = @"D:\Test\grouping\";
string inputUri = folder + "FlatInput.xml";
string stylesheet1uri = folder + "FlatInput2Grouped.xslt";
 
string outputUri = folder + "groupedOutput_XslTransform.xml";
 
DateTime beforeStart = DateTime.Now;
DateTime afterLoadingInput, afterLoadingStylesheet, afterTransform;
using(FileStream output = new FileStream(outputUri,FileMode.Create
FileAccess.Write,FileShare.Read))
{
XPathDocument inputDocument = new XPathDocument(inputUri);
afterLoadingInput = DateTime.Now;
 
XslTransform transform = new XslTransform();
 
transform.Load(
new XPathDocument(stylesheet1uri), 
null,
this.GetType().Assembly.Evidence);
afterLoadingStylesheet = DateTime.Now;
 
transform.Transform(inputDocument,null,output,null);
afterTransform = DateTime.Now;
}
 
******************
Timings:
 
MSXSL:
 
groupedOutput_verysmall_msxsl.xml (approx. 48 records)
---------------------------------
Source document load time: 27.68 milliseconds
Stylesheet document load time: 1.810 milliseconds
Stylesheet compile time: 1.266 milliseconds
Stylesheet execution time: 6.178 milliseconds
 
groupedOutput_small_msxsl.xml (144 records)
-----------------------------
Source document load time: 45.77 milliseconds
Stylesheet document load time: 2.145 milliseconds
Stylesheet compile time: 1.297 milliseconds
Stylesheet execution time: 48.66 milliseconds
 
groupedOutput_medium_msxsl.xml (approx. 10000 records)
------------------------------
Source document load time: 1507 milliseconds
Stylesheet document load time: 11.85 milliseconds
Stylesheet compile time: .648 milliseconds
Stylesheet execution time: 1634 milliseconds
 
groupedOutput_msxsl.xml (approx. 60000 records, 30MB file size)
-----------------------
Source document load time: 11276 milliseconds
Stylesheet document load time: 3.053 milliseconds
Stylesheet compile time: .652 milliseconds
Stylesheet execution time: 40403 milliseconds
 
============
 
XSLTRANSFORM:
(timings of second transformation, to exclude JIT compilation time)
 
groupedOutput_verysmall_XslTransform.xml (48 records)
----------------------------------------
Source document load time: 30 milliseconds
Stylesheet document load time: 10 milliseconds
Stylesheet execution time: 130 milliseconds
 
groupedOutput_small_XslTransform.xml (144 records)
------------------------------------
Source document load time: 50 milliseconds
Stylesheet document load time: 10 milliseconds
Stylesheet execution time: 270 milliseconds
 
groupedOutput_medium_XslTransform.xml (approx. 10000 records)
-------------------------------------
[SEVERAL HOURS]
 
groupedOutput_XslTransform.xml (approx. 60000 records, 30MB file size)
------------------------------
[FOREVER ?]

Current Thread