Re: [xsl] Spelling Othello (Was: Re: [xsl] Text processing on XSLT 2.0)

Subject: Re: [xsl] Spelling Othello (Was: Re: [xsl] Text processing on XSLT 2.0)
From: "M. David Peterson" <m.david.x2x2x@xxxxxxxxx>
Date: Mon, 4 Apr 2005 15:10:37 -0600
Well, I think that about covers it... FXSL it is then :)

Please see http://www.xsltblog.com/archives/2005/04/my_reaction_a_r.html
for a slightly extended reaction...

Thank you Dimitre!!!  As always the capabilites of FXSL have proven to
be flat out amazing.

Cheers :)

On Apr 4, 2005 3:00 PM, Dimitre Novatchev <dnovatchev@xxxxxxxxx> wrote:
> I didn't mention that the text I was spelling was the play:
>
>  "Othello"
>
> by William Shakespeare
>
> On Apr 5, 2005 6:56 AM, Dimitre Novatchev <dnovatchev@xxxxxxxxx> wrote:
> > On Apr 5, 2005 6:41 AM, M. David Peterson <m.david.x2x2x@xxxxxxxxx>
wrote:
> > > Working on projects such as XBiblio/Citeproc lead by Bruce D'Arcus
> > > I have realized that even as far as the XSLT 2.0 working draft goes in
> > > regards to bringing Perl'esque type text processing to the XML
> > > developer it is still up to the developer to fine-tune these
> > > capabilities to cover their specific needs.  For example, a spell
> > > checker.
> > >
> > > Can anyone who may have extended experience in regards to the
> > > development of such capabilities using XSLT share with the rest of us
> > > your experience?
> >
> > Hi Mark,
> >
> > These days I had fun with an f:binSearch() function and then,
> > logically, with f:spell().
> >
> > I have a dictionary of about 47000 English wordforms, on which I
> > search with f:binSearch()
> >
> > I had to produce a faster fn than the current quadratical
> > str-split-to-words template -- this is the f:getWords() function.
> >
> > All these functions can be downloaded from the FXSL CVS (just let me
> > know if you'd want me to send you the zip archive).
> >
> > The combination of these functions works quite well.
> >
> > This transformation (test-FuncSpell.xsl):
> >
> > <xsl:stylesheet version="2.0"
> > xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
> > xmlns:xs="http://www.w3.org/2001/XMLSchema";
> > xmlns:f="http://fxsl.sf.net/";
> > exclude-result-prefixes="f xs"
> > >
> >  <xsl:import href="../f/func-getWords.xsl"/>
> >  <xsl:import href="../f/func-spell.xsl"/>
> >
> >  <xsl:output omit-xml-declaration="yes"/>
> >
> > <xsl:variable name="vDelim" as="xs:string">
> > ,:.-&#9;&#10;&#13;'!?;</xsl:variable>
> >
> > <!-- To be applied on ../data/othello.xml -->
> >  <xsl:template match="/">
> >    <xsl:variable name="vwordNodes" as="element()*">
> >       <xsl:for-each select="//text()/lower-case(.)">
> >         <xsl:sequence select="f:getWords(., $vDelim, 1)"/>
> >       </xsl:for-each>
> >    </xsl:variable>
> >
> >    <xsl:variable name="vUnique" as="xs:string+">
> >      <xsl:perform-sort select="distinct-values($vwordNodes)">
> >        <xsl:sort select="."/>
> >      </xsl:perform-sort>
> >    </xsl:variable>
> >
> >    <xsl:variable name="vnotFound" as="xs:string*"
> >     select="$vUnique[not(f:spell(.))]"/>
> >
> >    <xsl:value-of separator="&#xA;"
> >     select="$vnotFound"/>
> >
> >    A total of <xsl:value-of select="count($vwordNodes)"/> words
> >    were spelt, (<xsl:value-of select="count($vUnique)"/>) distinct.
> >
> >    <xsl:value-of select="count($vnotFound)"/> not found.
> > </xsl:template>
> > </xsl:stylesheet>
> >
> > when applied on othello.xml (around 29000 words)
> >
> > produces this result:
> >
> > Saxon 8.3 from Saxonica
> > Java version 1.5.0_01
> > Stylesheet compilation time: 1140 milliseconds
> > Processing file:/C:\xml\Parsers\Saxon\Ver.8.3\samples\data\othello.xml
> > Building tree for
> > file:/C:\xml\Parsers\Saxon\Ver.8.3\samples\data\othello.xml using
> > class net.sf.saxon.tinytree.TinyBuilder
> > Tree built in 94 milliseconds
> > Tree size: 18539 nodes, 154557 characters, 0 attributes
> > Building tree for file:/C:/CVS-DDN/fxsl-xslt2/f/func-getWords.xsl
> > using class net.sf.saxon.tinytree.TinyBuilder
> > Tree built in 0 milliseconds
> > Tree size: 43 nodes, 143 characters, 22 attributes
> > Building tree for file:/C:/CVS-DDN/fxsl-xslt2/data/dictEnglish.xml
> > using class net.sf.saxon.tinytree.TinyBuilder
> > Tree built in 188 milliseconds
> > Tree size: 139140 nodes, 528397 characters, 0 attributes
> > Execution time: 7015 milliseconds
> >
> > <a-list-of-567-unknown-words-ommitted/>
> >
> >    A total of 28622 words
> >    were spelt, (3669) distinct.
> >
> >    567 not found.
> >
> > So, checking 3669 distinct words in 7015  milliseconds makes
> >
> >  523.02 words/sec.
> >
> > The actual speed is faster, as the total time includes splitting up
> > the words and finding the distinct words.
> >
> > Among the unknown words are such nice words as:
> >
> > affordeth
> > affrighted
> > ariseth
> > arithmetician
> > arrivance
> > bethink
> > betimes
> > bewhored
> >
> > :o)
> >
> > Cheers,
> >
> > Dimitre
>
>


--
<M:D/>

:: M. David Peterson ::
XML & XML Transformations, C#, .NET, and Functional Languages Specialist

Current Thread