Subject: Re: [xsl] Tokenize question: tokenize on words, spaces and punctuation From: Brandon Ibach <brandon.ibach@xxxxxxxxxxxxxxxxxxx> Date: Thu, 17 Mar 2011 00:19:44 -0400 |
The main trick here seems to be simply constructing an appropriate character class for each type of token and then matching sequences of one or more of each. The following does just that, though it also tosses in a twist to handle words with embedded dashes, so that the dash won't break the word into three separate tokens. Further adjustments along those lines may be needed, depending on your requirements. The use of Unicode character categories for the character classes should ensure that this works for most languages, I think, though non-English languages aren't my strong suit, so I make no guarantees. :) <?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:f="urn:stylesheet-func" exclude-result-prefixes="xs f"> <xsl:output method="text"/> <xsl:param name="s" select="'Oh, what a fun-filled day!'"/> <xsl:function name="f:tokens" as="xs:string*"> <xsl:param name="string"/> <xsl:analyze-string select="$string" regex="{'\w[-\w]*|[\p{P}\p{C}]+|\p{Z}+'}"> <xsl:matching-substring><xsl:sequence select="."/></xsl:matching-substring> </xsl:analyze-string> </xsl:function> <xsl:template match="/"> <xsl:text>('</xsl:text> <xsl:value-of select="f:tokens($s)" separator="', '"/> <xsl:text>')</xsl:text> </xsl:template> </xsl:stylesheet> -Brandon :) On Wed, Mar 16, 2011 at 8:33 PM, Martin Holmes <mholmes@xxxxxxx> wrote: > Hi there, > > This is really a question for XPath regex gurus: > > I need to tokenize a string of text such that words, punctuation and spaces > are split. So from this: > > Oh, what a great day! > > I need to get: > > ('Oh', ',', ' ', 'what', ' ', 'a', ' ', 'great', ' ', 'day', '!') > > I've been hacking away at this for a while, but regexps aren't my strong > suit. Can anyone help? > > Cheers, > Martin
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
[xsl] Re: Tokenize question: tokeni, Martin Holmes | Thread | [xsl] Re: Tokenize question: tokeni, Martin Holmes |
Re: [xsl] Tokenize question: tokeni, Suresh | Date | [xsl] Re: Tokenize question: tokeni, Martin Holmes |
Month |