Re: [xsl] Does 'Lec?ur' occur in $text? Do you have a multi-fa ctor XPath solution?

Subject: Re: [xsl] Does 'Lec?ur' occur in $text? Do you have a multi-fa ctor XPath solution?
From: Louis-Dominique Dubeau <ldd@xxxxxxxxxxxx>
Date: Fri, 18 Jan 2013 18:54:22 -0500
On Fri, 2013-01-18 at 17:17 -0500, G. Ken Holman wrote:
> >2. Perhaps instead of the 'B' ligature, $text uses 'oe'
> 
> Use normalize-unicode() on both operands.

I did not think it would work, so I created a test and indeed it does
not work. There's a good reason for this: generally speaking the single
letter E and the two letters oe are not equivalent. Reading the Unicode
documentation for u0153 confirms that E has no decomposition.

Here is the XSL for the test:

<?xml version="1.0"?>
<xsl:stylesheet version="2.0" 
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform";>
  
<xsl:output method="xml" 
            encoding="UTF-8"/>

<xsl:template match="/">
  <xsl:for-each select="('NFC', 'NFD', 'NFKC', 'NFKD')">
    <xsl:message><xsl:value-of select="concat(current(), ' ')"/>
<xsl:value-of select="normalize-unicode('cEur', current()) =
normalize-unicode('coeur', current())"/></xsl:message> 
  </xsl:for-each>  
  <xsl:for-each select="('NFC', 'NFD', 'NFKC', 'NFKD')">
    <xsl:message><xsl:value-of select="concat(current(), ' ')"/>
<xsl:value-of select="normalize-unicode('eL', current()) =
normalize-unicode('C)', current())"/></xsl:message> 
  </xsl:for-each>  
</xsl:template>

</xsl:stylesheet>

The first loop compares cEur and coeur after normalization. The results
are false, no matter what normalization we use. The second loop is for
illustration purpose: it compares a C) which made of two unicode code
points with C) made of one unicode code point. The comparisons are true
in all cases, as expected.

If you save the XSL above as normalize-unicode.xsl, run it as:

$ saxon -s:normalize-unicode.xsl -xsl:normalize-unicode.xsl 

And you get:

NFC false
NFD false
NFKC false
NFKD false
NFC true
NFD true
NFKC true
NFKD true
<?xml version="1.0" encoding="UTF-8"?>

Sincerely,
Louis

Current Thread