Subject: Re: [xsl] special character encoding, two problems From: "Wolfgang Laun wolfgang.laun@xxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> Date: Fri, 24 Oct 2014 18:22:44 -0000 |
The German "sharp s" has morphed from a ligature ("long" s + "z" in old letterings and fonts) to a proper letter of its own. This is very clearly reflected by the rule to replace "C" by "ss" if "C" is not available. This is a process similar to the one that has long since resulted in "w", a letter which was once a ligature of "uu" as its English (and Hungarian: "dupla-v") name imply. In the Folio edition, "Hamlet" begins with a double capital "V": "VVhose...". As to the Danish and Norwegian (not Swedish!) *C* and *C8*: it is considered a letter of its own, as is the letter E: U+0141 or E: U+0142 that occurs among others in the Polish language. In the latter, this letter is clearly distinct in its pronounciation from "L" (more like an English "w"), and the "/" as a diacritical mark is a shape different from the one crossing the "O", so I guess there's more than one good reason not to have a diacritical mark "/" as a separate Unicode code point. (As to indexing according to letters dismembered into US-ASCII, I can't imagine that this will be useful except for a very crude sorting criterion, e.g., "On which shelf do we stash author X?") -W On 24 October 2014 18:55, Michael Kay mike@xxxxxxxxxxxx < xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote: > > What I'm unclear on is why the function is correctly converting "é" > to "e", but not "ø" to "o". > > > Because Unicode normalization into decomposed form does not split xf8 into > an "o" and a "/" modifier. Don't ask me why, probably there were some > voluble and well educated Swedes on the committee who insisted that xf8 was > not a modified "o". > > Some of the characters below are ligatures, e.g. C and C& and E, some (like > thorn) are first-class letters in their own right that just happen not to > be used in English. > > If you only need to transliterate these characters, and not the whole of > Cyrillic, Greek, Hebrew, etc, then I think you would be best off just > enumerating them. > > Michael Kay > Saxonica > > > Is there a way to make this function convert all accented latin letters to > plain ascii characters? We really need coverage for any letter that can > appear in a European name, so this should also convert the numeric > character reference for thorn (C>, þ) to one or more plain ascii > characters, to cover authors from Iceland. > > I ran a broad test of all the accented latin letters most likely to occur > in author names, and these 28 characters are the only ones that were not > converted to plain ascii equivalents: > > Æ C > Ð C > Ø C > Þ C > ß C > æ C& > ð C0 > ø C8 > þ C> > Đ D > đ D > Ħ D& > ħ D' > ı D1 > Ł E > ł E > Ŋ E > ŋ E > Œ E > œ E > Ŧ E& > ŧ E' > ƀ F > Ɨ F > Ƶ F5 > ƶ F6 > Ǥ G$ > ǥ G% > > Is there a different set of flags for this function that will yield the > result I'm looking for? If this function cannot do that, what is the best > way to convert all of these outlying characters? I need this conversion to > happen in only one element of my XML, not the entire XML document. I can't > use translate because it's a one-to-one conversion that doesn't cover the > ligatures listed above. If normalize-unicode cannot be made to cover all > the characters listed above, can character-maps be applied that act > specifically on only one element? > > Thanks, > Joni > > > > On 10/24/14 9:11 AM, Eliot Kimber ekimber@xxxxxxxxxxxx wrote: > > I can't restrain my own pedantry: the correct term is "numeric character > reference", not "numeric entity": http://www.w3.org/TR/REC-xml/#dt-charref > > Given that I think I'm the only person who ever uses the term correctly > and consistently, we probably should have just used "numeric entity" but > so it goes. > > Cheers, > > E. > bbbbb > Eliot Kimber, Owner > Contrext, LLChttp://contrext.com > > > > > On 10/23/14, 4:13 PM, "Graydon graydon@xxxxxxxxx"<xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote: > > > On Thu, Oct 23, 2014 at 08:39:11PM -0000, Jonina Dames jdames@xxxxxxxxx > scripsit: > > Thanks for the advice! The <xsl:value-of > > select="normalize-unicode(replace(normalize-unicode(.,'NFKD'),'\p{Mn}','' > ),'NFKC')" > /> function works for most of the entities, but it's still missing a > couple dozen characters. > > Terminology pedant time -- > > é is a numeric entity and exactly the same thing as C) just > written differently. > > é is a named entity reference (which had better be defined > somewhere) > > Either, as soon as the XML document is parsed, turns into U+00E9 in some > internal representation and they're not different from each other or the > representation for C) if someone had typed that directly in the utf-8 > input file. > > So when you say "entity" here I'm getting the nervous feeling that I > don't know what you mean; can you provide some examples? > > > Some of the author names still have unicode entities instead of plain > ascii (for example, several characters with a stroke, several > ligatures, thorn characters, upper and lowercase). Is there a > > Well, examples would be good, but thorn, for example, þ which is > the self-same code point as C>, doesn't involve a modifier; it's one > whole letter that doesn't exist inside ASCII. > > Stripping the modifiers -- which will give you e from C) if you decompose > C) first, because then it's e + K , which you could write e + > ́ and it would be the same -- doesn't do anything because there > is no modifier there, it's just the single code-point for thorn. > > > variation of this function or a parameter that will catch and convert > ALL of these to plain ascii, as well as the standard acute and cedil > characters? Or do I need to address these outlying characters with > something else (not translate, since I can't use a one-to-one > replacement for ligature entities)? > > ASCII, strictly, is seven-bit; there are lots of things you can't > represent in ASCII. é *is not* ASCII just because those eight > characters all happen to be ASCII characters. > > So it sounds like you're trying to (either) map U+00FE, C>, to þ or > something like that (which is not, I cannot stress too much, ASCII; it > might be an ASCII representation of a non-ASCII code-point, but it's > still a non-ASCII code-point) or have C> decompose into t+h or something > of that ilk. (Which is at least actually ASCII.) > > Either way you'd have to use character mappings for those; there aren't > any modifiers to remove. > > Are you really compelled to deliver seven bit ASCII? > > And, please, some examples. > > -- Graydon > > > > > > -- > Jonina Dames > Customer Support Specialist > Inera Inc. > +1 617 932 1932 > eXtyles on Twitter <https://twitter.com/extyles> > jdames@xxxxxxxxx > > ----------------------------------------------------------------- > This email message and any attachments are confidential. If you are not > the intended recipient, please immediately reply to the sender or call > 617-932-1932 and delete the message from your email system. Thank you. > ------------------------------------------------------------------- > XSL-List info and archive <http://www.mulberrytech.com/xsl/xsl-list> > EasyUnsubscribe <http://-list/293509> (by email) > > > XSL-List info and archive <http://www.mulberrytech.com/xsl/xsl-list> > EasyUnsubscribe <-list/528976> (by > email <>)
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] special character encodin, Wolfgang Laun wolfga | Thread | Re: [xsl] special character encodin, Michael Sokolov msok |
Re: [xsl] special character encodin, Jonina Dames jdames@ | Date | Re: [xsl] special character encodin, Graydon graydon@xxxx |
Month |