Re: [xsl] special character encoding, two problems

On Thu, Oct 23, 2014 at 08:39:11PM -0000, Jonina Dames jdames@xxxxxxxxx
scripsit:
> Thanks for the advice! The <xsl:value-of
> select="normalize-unicode(replace(normalize-unicode(.,'NFKD'),'\p{Mn}',''),'NFKC')"
> /> function works for most of the entities, but it's still missing a
> couple dozen characters. 

Terminology pedant time --

&#x00e9; is a numeric entity and exactly the same thing as C) just
written differently.

&eacute; is a named entity reference (which had better be defined
somewhere)

Either, as soon as the XML document is parsed, turns into U+00E9 in some
internal representation and they're not different from each other or the
representation for C) if someone had typed that directly in the utf-8
input file.

So when you say "entity" here I'm getting the nervous feeling that I
don't know what you mean; can you provide some examples?

> Some of the author names still have unicode entities instead of plain
> ascii (for example, several characters with a stroke, several
> ligatures, thorn characters, upper and lowercase). Is there a

Well, examples would be good, but thorn, for example, &#x00FE; which is
the self-same code point as C>, doesn't involve a modifier; it's one
whole letter that doesn't exist inside ASCII.

Stripping the modifiers -- which will give you e from C) if you decompose
C) first, because then it's e + K
, which you could write &#x0065; +
&#x0301; and it would be the same -- doesn't do anything because there
is no modifier there, it's just the single code-point for thorn.

> variation of this function or a parameter that will catch and convert
> ALL of these to plain ascii, as well as the standard acute and cedil
> characters? Or do I need to address these outlying characters with
> something else (not translate, since I can't use a one-to-one
> replacement for ligature entities)?

ASCII, strictly, is seven-bit; there are lots of things you can't
represent in ASCII.  &#x00e9; *is not* ASCII just because those eight
characters all happen to be ASCII characters.

So it sounds like you're trying to (either) map U+00FE, C>, to &thorn; or
something like that (which is not, I cannot stress too much, ASCII; it
might be an ASCII representation of a non-ASCII code-point, but it's
still a non-ASCII code-point) or have C> decompose into t+h or something
of that ilk.  (Which is at least actually ASCII.)

Either way you'd have to use character mappings for those; there aren't
any modifiers to remove.

Are you really compelled to deliver seven bit ASCII?

And, please, some examples.

-- Graydon

Current Thread
Re: [xsl] special character encoding, two problems, (continued) Eliot Kimber ekimber@xxxxxxxxxxxx - 15 Oct 2014 19:23:13 -0000 Wolfgang Laun wolfgang.laun@xxxxxxxxx - 16 Oct 2014 15:23:57 -0000 Eliot Kimber ekimber@xxxxxxxxxxxx - 16 Oct 2014 15:24:04 -0000 Jonina Dames jdames@xxxxxxxxx - 23 Oct 2014 20:39:00 -0000 Graydon graydon@xxxxxxxxx - 23 Oct 2014 21:13:45 -0000 <= Eliot Kimber ekimber@xxxxxxxxxxxx - 24 Oct 2014 13:11:37 -0000 Jonina Dames jdames@xxxxxxxxx - 24 Oct 2014 16:27:05 -0000 Michael Kay mike@xxxxxxxxxxxx - 24 Oct 2014 16:54:33 -0000 Jonina Dames jdames@xxxxxxxxx - 24 Oct 2014 17:10:43 -0000

Current Thread

Re: [xsl] special character encoding, two problems, (continued)
- Jonina Dames jdames@xxxxxxxxx - 23 Oct 2014 20:39:00 -0000
  - Graydon graydon@xxxxxxxxx - 23 Oct 2014 21:13:45 -0000 <=
    - Eliot Kimber ekimber@xxxxxxxxxxxx - 24 Oct 2014 13:11:37 -0000
    - Jonina Dames jdames@xxxxxxxxx - 24 Oct 2014 16:27:05 -0000
    - Michael Kay mike@xxxxxxxxxxxx - 24 Oct 2014 16:54:33 -0000
    - Jonina Dames jdames@xxxxxxxxx - 24 Oct 2014 17:10:43 -0000

<- Previous	Index	Next ->
Re: [xsl] special character encodin, Jonina Dames jdames@	Thread	Re: [xsl] special character encodin, Eliot Kimber ekimber
Re: [xsl] special character encodin, Jonina Dames jdames@	Date	[xsl] FO: Scaling and centering con, Michael Müller-Hille
	Month

<-prev [Thread] next->	<-prev [Date] next->
Month Index \| List Home