Subject: [xsl] Spelling Othello (Was: Re: [xsl] Text processing on XSLT 2.0) From: Dimitre Novatchev <dnovatchev@xxxxxxxxx> Date: Tue, 5 Apr 2005 07:00:07 +1000 |
I didn't mention that the text I was spelling was the play: "Othello" by William Shakespeare On Apr 5, 2005 6:56 AM, Dimitre Novatchev <dnovatchev@xxxxxxxxx> wrote: > On Apr 5, 2005 6:41 AM, M. David Peterson <m.david.x2x2x@xxxxxxxxx> wrote: > > Working on projects such as XBiblio/Citeproc lead by Bruce D'Arcus > > I have realized that even as far as the XSLT 2.0 working draft goes in > > regards to bringing Perl'esque type text processing to the XML > > developer it is still up to the developer to fine-tune these > > capabilities to cover their specific needs. For example, a spell > > checker. > > > > Can anyone who may have extended experience in regards to the > > development of such capabilities using XSLT share with the rest of us > > your experience? > > Hi Mark, > > These days I had fun with an f:binSearch() function and then, > logically, with f:spell(). > > I have a dictionary of about 47000 English wordforms, on which I > search with f:binSearch() > > I had to produce a faster fn than the current quadratical > str-split-to-words template -- this is the f:getWords() function. > > All these functions can be downloaded from the FXSL CVS (just let me > know if you'd want me to send you the zip archive). > > The combination of these functions works quite well. > > This transformation (test-FuncSpell.xsl): > > <xsl:stylesheet version="2.0" > xmlns:xsl="http://www.w3.org/1999/XSL/Transform" > xmlns:xs="http://www.w3.org/2001/XMLSchema" > xmlns:f="http://fxsl.sf.net/" > exclude-result-prefixes="f xs" > > > <xsl:import href="../f/func-getWords.xsl"/> > <xsl:import href="../f/func-spell.xsl"/> > > <xsl:output omit-xml-declaration="yes"/> > > <xsl:variable name="vDelim" as="xs:string"> > ,:.-	 '!?;</xsl:variable> > > <!-- To be applied on ../data/othello.xml --> > <xsl:template match="/"> > <xsl:variable name="vwordNodes" as="element()*"> > <xsl:for-each select="//text()/lower-case(.)"> > <xsl:sequence select="f:getWords(., $vDelim, 1)"/> > </xsl:for-each> > </xsl:variable> > > <xsl:variable name="vUnique" as="xs:string+"> > <xsl:perform-sort select="distinct-values($vwordNodes)"> > <xsl:sort select="."/> > </xsl:perform-sort> > </xsl:variable> > > <xsl:variable name="vnotFound" as="xs:string*" > select="$vUnique[not(f:spell(.))]"/> > > <xsl:value-of separator="
" > select="$vnotFound"/> > > A total of <xsl:value-of select="count($vwordNodes)"/> words > were spelt, (<xsl:value-of select="count($vUnique)"/>) distinct. > > <xsl:value-of select="count($vnotFound)"/> not found. > </xsl:template> > </xsl:stylesheet> > > when applied on othello.xml (around 29000 words) > > produces this result: > > Saxon 8.3 from Saxonica > Java version 1.5.0_01 > Stylesheet compilation time: 1140 milliseconds > Processing file:/C:\xml\Parsers\Saxon\Ver.8.3\samples\data\othello.xml > Building tree for > file:/C:\xml\Parsers\Saxon\Ver.8.3\samples\data\othello.xml using > class net.sf.saxon.tinytree.TinyBuilder > Tree built in 94 milliseconds > Tree size: 18539 nodes, 154557 characters, 0 attributes > Building tree for file:/C:/CVS-DDN/fxsl-xslt2/f/func-getWords.xsl > using class net.sf.saxon.tinytree.TinyBuilder > Tree built in 0 milliseconds > Tree size: 43 nodes, 143 characters, 22 attributes > Building tree for file:/C:/CVS-DDN/fxsl-xslt2/data/dictEnglish.xml > using class net.sf.saxon.tinytree.TinyBuilder > Tree built in 188 milliseconds > Tree size: 139140 nodes, 528397 characters, 0 attributes > Execution time: 7015 milliseconds > > <a-list-of-567-unknown-words-ommitted/> > > A total of 28622 words > were spelt, (3669) distinct. > > 567 not found. > > So, checking 3669 distinct words in 7015 milliseconds makes > > 523.02 words/sec. > > The actual speed is faster, as the total time includes splitting up > the words and finding the distinct words. > > Among the unknown words are such nice words as: > > affordeth > affrighted > ariseth > arithmetician > arrivance > bethink > betimes > bewhored > > :o) > > Cheers, > > Dimitre
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] Text processing on XSLT 2, Dimitre Novatchev | Thread | Re: [xsl] Spelling Othello (Was: Re, M. David Peterson |
Re: [xsl] Text processing on XSLT 2, Dimitre Novatchev | Date | Re: [xsl] Spelling Othello (Was: Re, M. David Peterson |
Month |