Re: [xsl] special character encoding, two problems

Subject: Re: [xsl] special character encoding, two problems
From: "Wolfgang Laun wolfgang.laun@xxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Fri, 24 Oct 2014 18:22:44 -0000
The German "sharp s" has morphed from a ligature ("long" s + "z" in old
letterings and fonts) to a proper letter of its own. This is very clearly
reflected by the rule to replace "C" by "ss" if "C" is not available. This
is a process similar to the one that has long since resulted in "w", a
letter which was once a ligature of "uu" as its English (and Hungarian:
"dupla-v") name imply. In the Folio edition, "Hamlet" begins with a double
capital "V": "VVhose...".

As to the Danish and Norwegian (not Swedish!) *C* and *C8*: it is considered
a letter of its own, as is the letter
E: U+0141 or E: U+0142 that occurs among others in the Polish language. In
the latter, this letter is clearly distinct in its pronounciation from "L"
(more like an English "w"), and the "/" as a diacritical mark is a shape
different from the one crossing the "O", so I guess there's more than one
good reason not to have a diacritical mark "/" as a separate Unicode code
point.

(As to indexing according to letters dismembered into US-ASCII, I can't
imagine that this will be useful except for a very crude sorting criterion,
e.g., "On which shelf do we stash author X?")

-W





On 24 October 2014 18:55, Michael Kay mike@xxxxxxxxxxxx <
xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote:

>
> What I'm unclear on is why the function is correctly converting "&#x00E9;"
> to "e", but not "&#xf8;" to "o".
>
>
> Because Unicode normalization into decomposed form does not split xf8 into
> an "o" and a "/" modifier. Don't ask me why, probably there were some
> voluble and well educated Swedes on the committee who insisted that xf8 was
> not a modified "o".
>
> Some of the characters below are ligatures, e.g. C and C& and E, some
(like
> thorn) are first-class letters in their own right that just happen not to
> be used in English.
>
> If you only need to transliterate these characters, and not the whole of
> Cyrillic, Greek, Hebrew, etc, then I think you would be best off just
> enumerating them.
>
> Michael Kay
> Saxonica
>
>
> Is there a way to make this function convert all accented latin letters to
> plain ascii characters? We really need coverage for any letter that can
> appear in a European name, so this should also convert the numeric
> character reference for thorn (C>, &#xfe;) to one or more plain ascii
> characters, to cover authors from Iceland.
>
> I ran a broad test of all the accented latin letters most likely to occur
> in author names, and these 28 characters are the only ones that were not
> converted to plain ascii equivalents:
>
> &#xc6;    C
> &#xd0;    C
> &#xd8;    C
> &#xde;    C
> &#xdf;    C
> &#xe6;    C&
> &#xf0;    C0
> &#xf8;    C8
> &#xfe;    C>
> &#x110;    D
> &#x111;    D
> &#x126;    D&
> &#x127;    D'
> &#x131;    D1
> &#x141;    E
> &#x142;    E
> &#x14a;    E

> &#x14b;    E
> &#x152;    E
> &#x153;    E
> &#x166;    E&
> &#x167;    E'
> &#x180;    F
> &#x197;    F
> &#x1b5;    F5
> &#x1b6;    F6
> &#x1e4;    G$
> &#x1e5;    G%
>
> Is there a different set of flags for this function that will yield the
> result I'm looking for? If this function cannot do that, what is the best
> way to convert all of these outlying characters? I need this conversion to
> happen in only one element of my XML, not the entire XML document. I can't
> use translate because it's a one-to-one conversion that doesn't cover the
> ligatures listed above. If normalize-unicode cannot be made to cover all
> the characters listed above, can character-maps be applied that act
> specifically on only one element?
>
> Thanks,
> Joni
>
>
>
> On 10/24/14 9:11 AM, Eliot Kimber ekimber@xxxxxxxxxxxx wrote:
>
> I can't restrain my own pedantry: the correct term is "numeric character
> reference", not "numeric entity": http://www.w3.org/TR/REC-xml/#dt-charref
>
> Given that I think I'm the only person who ever uses the term correctly
> and consistently, we probably should have just used "numeric entity" but
> so it goes.
>
> Cheers,
>
> E.
> bbbbb
> Eliot Kimber, Owner
> Contrext, LLChttp://contrext.com
>
>
>
>
> On 10/23/14, 4:13 PM, "Graydon
graydon@xxxxxxxxx"<xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
<xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote:
>
>
>  On Thu, Oct 23, 2014 at 08:39:11PM -0000, Jonina Dames jdames@xxxxxxxxx
> scripsit:
>
>  Thanks for the advice! The <xsl:value-of
>
> select="normalize-unicode(replace(normalize-unicode(.,'NFKD'),'\p{Mn}',''
> ),'NFKC')"
> /> function works for most of the entities, but it's still missing a
> couple dozen characters.
>
>  Terminology pedant time --
>
> &#x00e9; is a numeric entity and exactly the same thing as C) just
> written differently.
>
> &eacute; is a named entity reference (which had better be defined
> somewhere)
>
> Either, as soon as the XML document is parsed, turns into U+00E9 in some
> internal representation and they're not different from each other or the
> representation for C) if someone had typed that directly in the utf-8
> input file.
>
> So when you say "entity" here I'm getting the nervous feeling that I
> don't know what you mean; can you provide some examples?
>
>
>  Some of the author names still have unicode entities instead of plain
> ascii (for example, several characters with a stroke, several
> ligatures, thorn characters, upper and lowercase). Is there a
>
>  Well, examples would be good, but thorn, for example, &#x00FE; which is
> the self-same code point as C>, doesn't involve a modifier; it's one
> whole letter that doesn't exist inside ASCII.
>
> Stripping the modifiers -- which will give you e from C) if you decompose
> C) first, because then it's e + K
, which you could write &#x0065; +
> &#x0301; and it would be the same -- doesn't do anything because there
> is no modifier there, it's just the single code-point for thorn.
>
>
>  variation of this function or a parameter that will catch and convert
> ALL of these to plain ascii, as well as the standard acute and cedil
> characters? Or do I need to address these outlying characters with
> something else (not translate, since I can't use a one-to-one
> replacement for ligature entities)?
>
>  ASCII, strictly, is seven-bit; there are lots of things you can't
> represent in ASCII.  &#x00e9; *is not* ASCII just because those eight
> characters all happen to be ASCII characters.
>
> So it sounds like you're trying to (either) map U+00FE, C>, to &thorn; or
> something like that (which is not, I cannot stress too much, ASCII; it
> might be an ASCII representation of a non-ASCII code-point, but it's
> still a non-ASCII code-point) or have C> decompose into t+h or something
> of that ilk.  (Which is at least actually ASCII.)
>
> Either way you'd have to use character mappings for those; there aren't
> any modifiers to remove.
>
> Are you really compelled to deliver seven bit ASCII?
>
> And, please, some examples.
>
> -- Graydon
>
>
>
>
>
> --
> Jonina Dames
> Customer Support Specialist
> Inera Inc.
> +1 617 932 1932
> eXtyles on Twitter <https://twitter.com/extyles>
> jdames@xxxxxxxxx
>
> -----------------------------------------------------------------
> This email message and any attachments are confidential. If you are not
> the intended recipient, please immediately reply to the sender or call
> 617-932-1932 and delete the message from your email system. Thank you.
> -------------------------------------------------------------------
>    XSL-List info and archive <http://www.mulberrytech.com/xsl/xsl-list>
> EasyUnsubscribe <http://-list/293509> (by email)
>
>
>   XSL-List info and archive <http://www.mulberrytech.com/xsl/xsl-list>
> EasyUnsubscribe <-list/528976> (by
> email <>)

Current Thread