Subject: Re: [xsl] special character encoding, two problems From: "Graydon graydon@xxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> Date: Fri, 24 Oct 2014 19:13:41 -0000 |
On Fri, Oct 24, 2014 at 04:27:18PM -0000, Jonina Dames jdames@xxxxxxxxx scripsit: > Hi Graydon, > > Thanks for replying. I'm actually trying to get just plain ascii equivalents Can you show me the plain ASCII equivalent for thorn? > Right now, the function I am using is this: > > <xsl:value-of select="normalize-unicode(replace(normalize-unicode(.,'NFKD'),'\p{Mn}',''),'NFKC')"/> > > What I'm unclear on is why the function is correctly converting "é" > to "e", but not "ø" to "o". In Unicode, you have different normal forms. The usual normal form, and the one XML documents "expect", is the composed normal form (NFC); where there's a single code point which represents the combination of a letter and an accent, use that single code point. So, use C) instead of e and a non-spacing modifier accent K . What the *decomposed* normal form -- the NFKD in the inner normalize-unicode call -- does is say, no, no, if we can represent this as a letter and some modifiers, do it that way. So we get eB4 and the B4 can be stripped by the replace as a member of the Unicode category "modifiers, nonspacing". When we get to C8, it's not a modified o, o LATIN SMALL LETTER O, but some other letter that just happens to *look* something like a latin small letter O, it's not categorized as an O with a modifier DESPITE being called "LATIN SMALL LETTER O WITH STROKE". I have no idea why; the ways of the Unicode Consortium are mysterious. So decomposing it doesn't produce an o and an accent, it produces U+00F8. So when the non-spacing modifiers are removed, nothing changes. > Is there a way to make this function convert all accented latin > letters to plain ascii characters? Well, technically, that's precisely and specifically what it does. The problem is that your clients appear to be having a disagreement with the Unicode Consortium about which letters are really accented Latin and which are letters in their own right. (Which is why I keep bringing up thorn; thorn is without question a letter with no direct ASCII analog.) > We really need coverage for any letter that can appear in a European > name, so this should also convert the numeric character reference for > thorn (C>, þ) to one or more plain ascii characters, to cover > authors from Iceland. But what? Thorn isn't really "th", just like edth isn't really "dh". (From time to time we've all had clients who were perhaps a little mad. This makes everyone much less willing to guess just how your particular client might be mad, rather than more, because madness is such a wide country.) > I ran a broad test of all the accented latin letters most likely to occur in > author names, and these 28 characters are the only ones that were not > converted to plain ascii equivalents: [snip list] > Is there a different set of flags for this function that will yield the > result I'm looking for? What result *are* you looking for? Many of those letters have no ASCII equivalent and are not generally considered the same for sorting. (Torvalds and Corvalds really shouldn't be sorted as the same author, for example.) But, specifically, no; there are five choices of normalization scheme, C, D, KC, KD, and "fully normalized". "C" is "composed" and "D" is "decomposed". The "K" stands for, I presume, "compatibility"; KC and KD are the stronger forms that normalize away compatible characters. (Unicode includes multiple representations of some characters because Unicode combines a bunch of pre-existing character representations. The K variants pick the most canonical representation of the character.) So the second argument for the decomposition is already as strong as you can get it. ("Fully normalized" has to do with string concatenation and is composed, anyway, so it won't help you here.) > If this function cannot do that, what is the best way to convert all > of these outlying characters? I need this conversion to happen in only > one element of my XML, not the entire XML document. I can't use > translate because it's a one-to-one conversion that doesn't cover the > ligatures listed above. If normalize-unicode cannot be made to cover > all the characters listed above, can character-maps be applied that > act specifically on only one element? Character maps apply to the whole result document, so they won't do what you want here. It looks to me like your best bet is to create a function that applies the decompose-replace-recompose trick, and then uses replace() repeatedly to find all your remaining problematic letters and replaces those strings with whatever, specifically, needs to be used in place of those letters, invoking that function to provide the contents of the specific element that needs its contents altered like this. It will be an ugly function but it at least gets to be very specific to your client's particular needs. -- Graydon
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] special character encodin, Michael Sokolov msok | Thread | [xsl] XPath Reference, Max Toro maxtoroq@xx |
Re: [xsl] special character encodin, Wolfgang Laun wolfga | Date | Re: [xsl] special character encodin, Wolfgang Laun wolfga |
Month |