Subject: Re: [xsl] Spelling Othello (Was: Re: [xsl] Text processing on XSLT 2.0) From: "M. David Peterson" <m.david.x2x2x@xxxxxxxxx> Date: Mon, 4 Apr 2005 15:10:37 -0600 |
Well, I think that about covers it... FXSL it is then :) Please see http://www.xsltblog.com/archives/2005/04/my_reaction_a_r.html for a slightly extended reaction... Thank you Dimitre!!! As always the capabilites of FXSL have proven to be flat out amazing. Cheers :) On Apr 4, 2005 3:00 PM, Dimitre Novatchev <dnovatchev@xxxxxxxxx> wrote: > I didn't mention that the text I was spelling was the play: > > "Othello" > > by William Shakespeare > > On Apr 5, 2005 6:56 AM, Dimitre Novatchev <dnovatchev@xxxxxxxxx> wrote: > > On Apr 5, 2005 6:41 AM, M. David Peterson <m.david.x2x2x@xxxxxxxxx> wrote: > > > Working on projects such as XBiblio/Citeproc lead by Bruce D'Arcus > > > I have realized that even as far as the XSLT 2.0 working draft goes in > > > regards to bringing Perl'esque type text processing to the XML > > > developer it is still up to the developer to fine-tune these > > > capabilities to cover their specific needs. For example, a spell > > > checker. > > > > > > Can anyone who may have extended experience in regards to the > > > development of such capabilities using XSLT share with the rest of us > > > your experience? > > > > Hi Mark, > > > > These days I had fun with an f:binSearch() function and then, > > logically, with f:spell(). > > > > I have a dictionary of about 47000 English wordforms, on which I > > search with f:binSearch() > > > > I had to produce a faster fn than the current quadratical > > str-split-to-words template -- this is the f:getWords() function. > > > > All these functions can be downloaded from the FXSL CVS (just let me > > know if you'd want me to send you the zip archive). > > > > The combination of these functions works quite well. > > > > This transformation (test-FuncSpell.xsl): > > > > <xsl:stylesheet version="2.0" > > xmlns:xsl="http://www.w3.org/1999/XSL/Transform" > > xmlns:xs="http://www.w3.org/2001/XMLSchema" > > xmlns:f="http://fxsl.sf.net/" > > exclude-result-prefixes="f xs" > > > > > <xsl:import href="../f/func-getWords.xsl"/> > > <xsl:import href="../f/func-spell.xsl"/> > > > > <xsl:output omit-xml-declaration="yes"/> > > > > <xsl:variable name="vDelim" as="xs:string"> > > ,:.-	 '!?;</xsl:variable> > > > > <!-- To be applied on ../data/othello.xml --> > > <xsl:template match="/"> > > <xsl:variable name="vwordNodes" as="element()*"> > > <xsl:for-each select="//text()/lower-case(.)"> > > <xsl:sequence select="f:getWords(., $vDelim, 1)"/> > > </xsl:for-each> > > </xsl:variable> > > > > <xsl:variable name="vUnique" as="xs:string+"> > > <xsl:perform-sort select="distinct-values($vwordNodes)"> > > <xsl:sort select="."/> > > </xsl:perform-sort> > > </xsl:variable> > > > > <xsl:variable name="vnotFound" as="xs:string*" > > select="$vUnique[not(f:spell(.))]"/> > > > > <xsl:value-of separator="
" > > select="$vnotFound"/> > > > > A total of <xsl:value-of select="count($vwordNodes)"/> words > > were spelt, (<xsl:value-of select="count($vUnique)"/>) distinct. > > > > <xsl:value-of select="count($vnotFound)"/> not found. > > </xsl:template> > > </xsl:stylesheet> > > > > when applied on othello.xml (around 29000 words) > > > > produces this result: > > > > Saxon 8.3 from Saxonica > > Java version 1.5.0_01 > > Stylesheet compilation time: 1140 milliseconds > > Processing file:/C:\xml\Parsers\Saxon\Ver.8.3\samples\data\othello.xml > > Building tree for > > file:/C:\xml\Parsers\Saxon\Ver.8.3\samples\data\othello.xml using > > class net.sf.saxon.tinytree.TinyBuilder > > Tree built in 94 milliseconds > > Tree size: 18539 nodes, 154557 characters, 0 attributes > > Building tree for file:/C:/CVS-DDN/fxsl-xslt2/f/func-getWords.xsl > > using class net.sf.saxon.tinytree.TinyBuilder > > Tree built in 0 milliseconds > > Tree size: 43 nodes, 143 characters, 22 attributes > > Building tree for file:/C:/CVS-DDN/fxsl-xslt2/data/dictEnglish.xml > > using class net.sf.saxon.tinytree.TinyBuilder > > Tree built in 188 milliseconds > > Tree size: 139140 nodes, 528397 characters, 0 attributes > > Execution time: 7015 milliseconds > > > > <a-list-of-567-unknown-words-ommitted/> > > > > A total of 28622 words > > were spelt, (3669) distinct. > > > > 567 not found. > > > > So, checking 3669 distinct words in 7015 milliseconds makes > > > > 523.02 words/sec. > > > > The actual speed is faster, as the total time includes splitting up > > the words and finding the distinct words. > > > > Among the unknown words are such nice words as: > > > > affordeth > > affrighted > > ariseth > > arithmetician > > arrivance > > bethink > > betimes > > bewhored > > > > :o) > > > > Cheers, > > > > Dimitre > > -- <M:D/> :: M. David Peterson :: XML & XML Transformations, C#, .NET, and Functional Languages Specialist
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
[xsl] Spelling Othello (Was: Re: [x, Dimitre Novatchev | Thread | Re: [xsl] Spelling Othello (Was: Re, Dimitre Novatchev |
[xsl] Spelling Othello (Was: Re: [x, Dimitre Novatchev | Date | RE: [xsl] Value of <id> element not, Michael Kay |
Month |