Re: [xsl] Tokenization - Thai language

Subject: Re: [xsl] Tokenization - Thai language
From: "Tony Graham" <tgraham@xxxxxxxxxx>
Date: Wed, 15 Jun 2011 12:22:00 +0100 (IST)
On Wed, June 15, 2011 11:24 am, Jan Pour wrote:
> I would like to tokenize Thai text on all places, where it can be
> broken to new line.
> How could I do it in XSLT? Using extensions in java??

My first thought would be to build an extension based on the International
Components for Unicode [1].  See, e.g., the documentation on boundary
analysis [2].

You wouldn't get very far tokenizing using regular expressions based on
'\w' or '\W' since, as you probably know, Thai ordinarily doesn't have
separators between words.

Regards,


Tony Graham                                   tgraham@xxxxxxxxxx
Consultant                                 http://www.mentea.net
Mentea       13 Kelly's Bay Beach, Skerries, Co. Dublin, Ireland
 --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --
    XML, XSL FO and XSLT consulting, training and programming

[1] http://site.icu-project.org/
[2] http://userguide.icu-project.org/boundaryanalysis

Current Thread