Re: [xsl] special character encoding, two problems

Subject: Re: [xsl] special character encoding, two problems
From: "Graydon graydon@xxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Fri, 24 Oct 2014 19:13:41 -0000
On Fri, Oct 24, 2014 at 04:27:18PM -0000, Jonina Dames jdames@xxxxxxxxx scripsit:
> Hi Graydon,
> 
> Thanks for replying. I'm actually trying to get just plain ascii equivalents

Can you show me the plain ASCII equivalent for thorn?

> Right now, the function I am using is this:
> 
>     <xsl:value-of select="normalize-unicode(replace(normalize-unicode(.,'NFKD'),'\p{Mn}',''),'NFKC')"/>
> 
> What I'm unclear on is why the function is correctly converting "&#x00E9;"
> to "e", but not "&#xf8;" to "o". 

In Unicode, you have different normal forms.

The usual normal form, and the one XML documents "expect", is the
composed normal form (NFC); where there's a single code point which
represents the combination of a letter and an accent, use that single
code point.

So, use C) instead of e and a non-spacing modifier accent K
.

What the *decomposed* normal form -- the NFKD in the inner
normalize-unicode call -- does is say, no, no, if we can represent this
as a letter and some modifiers, do it that way.  So we get eB4 and the B4
can be stripped by the replace as a member of the Unicode category
"modifiers, nonspacing".

When we get to C8, it's not a modified o, &#x006f; LATIN SMALL LETTER O,
but some other letter that just happens to *look* something like a
latin small letter O, it's not categorized as an O with a modifier
DESPITE being called "LATIN SMALL LETTER O WITH STROKE".  I have no idea
why; the ways of the Unicode Consortium are mysterious.  So decomposing
it doesn't produce an o and an accent, it produces U+00F8.  So when the
non-spacing modifiers are removed, nothing changes.

> Is there a way to make this function convert all accented latin
> letters to plain ascii characters? 

Well, technically, that's precisely and specifically what it does.  The
problem is that your clients appear to be having a disagreement with the
Unicode Consortium about which letters are really accented Latin and
which are letters in their own right.  (Which is why I keep bringing up
thorn; thorn is without question a letter with no direct ASCII analog.)

> We really need coverage for any letter that can appear in a European
> name, so this should also convert the numeric character reference for
> thorn (C>, &#xfe;) to one or more plain ascii characters, to cover
> authors from Iceland.

But what?  Thorn isn't really "th", just like edth isn't really "dh".

(From time to time we've all had clients who were perhaps a little mad.
This makes everyone much less willing to guess just how your particular
client might be mad, rather than more, because madness is such a wide
country.)

> I ran a broad test of all the accented latin letters most likely to occur in
> author names, and these 28 characters are the only ones that were not
> converted to plain ascii equivalents:
[snip list] 
> Is there a different set of flags for this function that will yield the
> result I'm looking for? 

What result *are* you looking for?  Many of those letters have no ASCII
equivalent and are not generally considered the same for sorting.
(Torvalds and Corvalds really shouldn't be sorted as the same author,
for example.)

But, specifically, no; there are five choices of normalization scheme,
C, D, KC, KD, and "fully normalized".

"C" is "composed" and "D" is "decomposed".

The "K" stands for, I presume, "compatibility"; KC and KD are the
stronger forms that normalize away compatible characters.  (Unicode
includes multiple representations of some characters because Unicode
combines a bunch of pre-existing character representations.  The K
variants pick the most canonical representation of the character.)  So
the second argument for the decomposition is already as strong as you
can get it.

("Fully normalized" has to do with string concatenation and is composed,
anyway, so it won't help you here.)

> If this function cannot do that, what is the best way to convert all
> of these outlying characters? I need this conversion to happen in only
> one element of my XML, not the entire XML document. I can't use
> translate because it's a one-to-one conversion that doesn't cover the
> ligatures listed above. If normalize-unicode cannot be made to cover
> all the characters listed above, can character-maps be applied that
> act specifically on only one element?

Character maps apply to the whole result document, so they won't do what
you want here.

It looks to me like your best bet is to create a function that applies
the decompose-replace-recompose trick, and then uses replace()
repeatedly to find all your remaining problematic letters and replaces
those strings with whatever, specifically, needs to be used in place of
those letters, invoking that function to provide the contents of the
specific element that needs its contents altered like this.

It will be an ugly function but it at least gets to be very specific to
your client's particular needs.

-- Graydon

Current Thread