[xsl] Spelling Othello (Was: Re: [xsl] Text processing on XSLT 2.0)

Subject: [xsl] Spelling Othello (Was: Re: [xsl] Text processing on XSLT 2.0)
From: Dimitre Novatchev <dnovatchev@xxxxxxxxx>
Date: Tue, 5 Apr 2005 07:00:07 +1000
I didn't mention that the text I was spelling was the play:

  "Othello"

by William Shakespeare

On Apr 5, 2005 6:56 AM, Dimitre Novatchev <dnovatchev@xxxxxxxxx> wrote:
> On Apr 5, 2005 6:41 AM, M. David Peterson <m.david.x2x2x@xxxxxxxxx> wrote:
> > Working on projects such as XBiblio/Citeproc lead by Bruce D'Arcus
> > I have realized that even as far as the XSLT 2.0 working draft goes in
> > regards to bringing Perl'esque type text processing to the XML
> > developer it is still up to the developer to fine-tune these
> > capabilities to cover their specific needs.  For example, a spell
> > checker.
> >
> > Can anyone who may have extended experience in regards to the
> > development of such capabilities using XSLT share with the rest of us
> > your experience?
>
> Hi Mark,
>
> These days I had fun with an f:binSearch() function and then,
> logically, with f:spell().
>
> I have a dictionary of about 47000 English wordforms, on which I
> search with f:binSearch()
>
> I had to produce a faster fn than the current quadratical
> str-split-to-words template -- this is the f:getWords() function.
>
> All these functions can be downloaded from the FXSL CVS (just let me
> know if you'd want me to send you the zip archive).
>
> The combination of these functions works quite well.
>
> This transformation (test-FuncSpell.xsl):
>
> <xsl:stylesheet version="2.0"
> xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
> xmlns:xs="http://www.w3.org/2001/XMLSchema";
> xmlns:f="http://fxsl.sf.net/";
> exclude-result-prefixes="f xs"
> >
>  <xsl:import href="../f/func-getWords.xsl"/>
>  <xsl:import href="../f/func-spell.xsl"/>
>
>  <xsl:output omit-xml-declaration="yes"/>
>
> <xsl:variable name="vDelim" as="xs:string">
> ,:.-&#9;&#10;&#13;'!?;</xsl:variable>
>
> <!-- To be applied on ../data/othello.xml -->
>  <xsl:template match="/">
>    <xsl:variable name="vwordNodes" as="element()*">
>       <xsl:for-each select="//text()/lower-case(.)">
>         <xsl:sequence select="f:getWords(., $vDelim, 1)"/>
>       </xsl:for-each>
>    </xsl:variable>
>
>    <xsl:variable name="vUnique" as="xs:string+">
>      <xsl:perform-sort select="distinct-values($vwordNodes)">
>        <xsl:sort select="."/>
>      </xsl:perform-sort>
>    </xsl:variable>
>
>    <xsl:variable name="vnotFound" as="xs:string*"
>     select="$vUnique[not(f:spell(.))]"/>
>
>    <xsl:value-of separator="&#xA;"
>     select="$vnotFound"/>
>
>    A total of <xsl:value-of select="count($vwordNodes)"/> words
>    were spelt, (<xsl:value-of select="count($vUnique)"/>) distinct.
>
>    <xsl:value-of select="count($vnotFound)"/> not found.
> </xsl:template>
> </xsl:stylesheet>
>
> when applied on othello.xml (around 29000 words)
>
> produces this result:
>
> Saxon 8.3 from Saxonica
> Java version 1.5.0_01
> Stylesheet compilation time: 1140 milliseconds
> Processing file:/C:\xml\Parsers\Saxon\Ver.8.3\samples\data\othello.xml
> Building tree for
> file:/C:\xml\Parsers\Saxon\Ver.8.3\samples\data\othello.xml using
> class net.sf.saxon.tinytree.TinyBuilder
> Tree built in 94 milliseconds
> Tree size: 18539 nodes, 154557 characters, 0 attributes
> Building tree for file:/C:/CVS-DDN/fxsl-xslt2/f/func-getWords.xsl
> using class net.sf.saxon.tinytree.TinyBuilder
> Tree built in 0 milliseconds
> Tree size: 43 nodes, 143 characters, 22 attributes
> Building tree for file:/C:/CVS-DDN/fxsl-xslt2/data/dictEnglish.xml
> using class net.sf.saxon.tinytree.TinyBuilder
> Tree built in 188 milliseconds
> Tree size: 139140 nodes, 528397 characters, 0 attributes
> Execution time: 7015 milliseconds
>
> <a-list-of-567-unknown-words-ommitted/>
>
>    A total of 28622 words
>    were spelt, (3669) distinct.
>
>    567 not found.
>
> So, checking 3669 distinct words in 7015  milliseconds makes
>
>  523.02 words/sec.
>
> The actual speed is faster, as the total time includes splitting up
> the words and finding the distinct words.
>
> Among the unknown words are such nice words as:
>
> affordeth
> affrighted
> ariseth
> arithmetician
> arrivance
> bethink
> betimes
> bewhored
>
> :o)
>
> Cheers,
>
> Dimitre

Current Thread