Re: [xsl] Katakana substitution regex

Subject: Re: [xsl] Katakana substitution regex
From: Lars Huttar <lars_huttar@xxxxxxx>
Date: Fri, 06 Aug 2010 15:57:07 -0500
On 8/6/2010 3:14 PM, Hoskins & Gretton wrote:
> HI, I have to convert some Katakana strings from "original" to "new"
> by adding &#12540; (#x30fc;) a pronunciation character (see
> http://www.fileformat.info/info/unicode/char/30fc/index.htm).
> In Japanese, there aren't any word boundaries, so essentially all of
> my search strings are substrings of the text of the current element.
> When substring "a" is followed by the character &#12540; I do not want
> to make the replacement.
>
> example:        &#12502;&#12521;&#12454;&#12470; is a search string
> but it is followed by &#12540; already -- do nothing
>
> When substring "a" is not followed by the character &#12540; I want to
> make the replacement to create "a" followed by &#12540;.
>
> example:        &#12502;&#12521;&#12454;&#12470; is a search string
> but it is not followed by #x30fc; already
>                 add to the end to make it
>                 &#12502;&#12521;&#12454;&#12470;&#12540;
>
> If I was going to just add the &#12540;, I was able to do that with a
> regex that contained the strings that I wanted to find by using regex
> and analyze-string, where $regexSearch contains all of my search
> Katakana strings:
>
>                 <xsl:analyze-string select="." regex="({$regexSearch})">
>                     <xsl:matching-substring>
>                         <xsl:value-of select="regex-group(1)"/>
>                         <xsl:text>&#12540;</xsl:text>
>                     </xsl:matching-substring>
>                     <xsl:non-matching-substring>
>                         <xsl:value-of select="."/>
>                     </xsl:non-matching-substring>
>                 </xsl:analyze-string>
> However,I can't figure out how I should fit this in to an overall
> xslt, where I need to check check ahead in the element text before I
> decide to make the substitution. Currently, if there is a
> string:                &#12502;&#12521;&#12454;&#12470;&#12540;
> it becomes:     &#12502;&#12521;&#12454;&#12470;&#12540;&#12540;
> (doubling the last character).
>
> If someone has some experience with this type of search and replace
> problem, I would appreciate some guidance.
> Regards, Dorothy
>
>

How about
   select="replace(., '&#12470;([^&#12540;])', '&#12470;&#12540;$1')"
?

And if that fails to catch &#12470; when it occurs at the end of a text
node, wrap the result in
    replace(., '&#12470;$', '&#12470;&#12540;')

HTH,
Lars

Current Thread