Subject: RE: [xsl] Over 300 MB XML file and XSLT or XQuery From: "Michael Kay" <mike@xxxxxxxxxxxx> Date: Thu, 13 Jan 2005 15:45:01 -0000 |
> To Michael Kay. Performance is not an issue. I am very > new to XQuery. I would like to get my hands dirty with > XQuery to learn a new trick of the trade but would > like to follow technically correct approach to solve > this kind of problem. > Lets assume I have solved the big XML file problem and > now given a text node, I need to search for this text > in the tens of thousands of small xml or html files, > generate stats like where it was found, how many times > etc. and if not found generate meaningful logs. I can > write Java classes if necessary. > > I would want to avoid converting small files into one > large file. I was thinking about treating collection > of all small files as an XML database and use Xquery. > In Saxon, if you use the doc() or document() function, then the file will be loaded into memory, and will stay in memory until the end of the run, just in case it's referenced again. So you will hit the same memory problem with lots of small files as with one large file - worse, in fact, since there is a significant per-document overhead. However, there's a workaround: an extension function saxon:discard-document() that causes a document to be discarded from memory by the garbage collector as soon as there are no more references to it. So you should be able to do a serial search of a large collection of documents something like this (let's assume $uris is a sequence of strings holding the document URIs): XQuery: for $u in $uris let $doc := saxon:discard-document(doc($u)) return if (my:condition($doc)) then <match uri="{$u}"/> else <no-match uri="{$u}"/> XSLT 2.0: <xsl:for-each select="for $u in $uris return saxon:discard-document(doc($u))"> <xsl:choose> <xsl:when test="my:condition(.)"> <match uri="{document-uri(.)}"/> </xsl:when> <xsl:otherwise> <no-match uri="{document-uri(.)}"/> </xsl:otherwise> </xsl:choose> </xsl:for-each> There's no real difference between the XSLT and XQuery solutions, it's just a different surface syntax. If the files are in a directory structure, then you should be able to read the directory structure directly by calling the relevant Java methods from your XSLT or XQuery code. See also: http://www.saxonica.com/documentation/extensions/functions/discarddocument.h tml Michael Kay http://www.saxonica.com/
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
[xsl] Over 300 MB XML file and XSLT, alan m | Thread | [xsl] multiple sort problem, Ravi Danda |
[xsl] alternative to namespace exma, dmitrik | Date | RE: [xsl] XSL Transformation overhe, Michael Kay |
Month |