Subject: Re: [xsl] Tokenization - Thai language From: "Tony Graham" <tgraham@xxxxxxxxxx> Date: Wed, 15 Jun 2011 12:22:00 +0100 (IST) |
On Wed, June 15, 2011 11:24 am, Jan Pour wrote: > I would like to tokenize Thai text on all places, where it can be > broken to new line. > How could I do it in XSLT? Using extensions in java?? My first thought would be to build an extension based on the International Components for Unicode [1]. See, e.g., the documentation on boundary analysis [2]. You wouldn't get very far tokenizing using regular expressions based on '\w' or '\W' since, as you probably know, Thai ordinarily doesn't have separators between words. Regards, Tony Graham tgraham@xxxxxxxxxx Consultant http://www.mentea.net Mentea 13 Kelly's Bay Beach, Skerries, Co. Dublin, Ireland -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- XML, XSL FO and XSLT consulting, training and programming [1] http://site.icu-project.org/ [2] http://userguide.icu-project.org/boundaryanalysis
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
[xsl] Tokenization - Thai language, Jan Pour | Thread | [xsl] compare string to a list of o, Szabo, Patrick \(LNG |
[xsl] Tokenization - Thai language, Jan Pour | Date | [xsl] compare string to a list of o, Szabo, Patrick \(LNG |
Month |