Re: [xsl] Flattening characters to plain latin

Subject: Re: [xsl] Flattening characters to plain latin
From: Abel Braaksma <abel.online@xxxxxxxxx>
Date: Fri, 16 Feb 2007 16:06:13 +0100
Colin Paul Adams wrote:
but I was under the impression that codepoint 127 was
not part of Latin-1.

Since it's part of ASCII, it's also part of Latin-1.


Agreed (I admit, I had to look it up). But, many codepoints of ascii are control characters, and many of them are not part of XML 1.0. &#127; is allowed in XML 1.0. But (!), since the OP was talking of characters, more precisely, "latin script unicode characters", I reckon these are not control characters. But only the OP can answer that one for certainty of course.

If you follow the http://www.unicode.org/charts/PDF/U0000.pdf document, it is not trivial to place DEL under the term Basic Latin (the caption above that char says: "Control Character"). To make matters worse, the \p{IsBasicLatin} does include all characters of that chart. So much for terminology.

This brings up an XSLT question regarding this:

matches(codepoints-to-string(01), '\p{IsBasicLatin}')

This code raises an error. However, since &#01; is not literally in the document, should this indeed raise an error? It is never serialized, it could as well return "true". Or does that violate the "characters must be allowed by XML version at all times" principle?

My verdict: If the 'lt' of Michael was on purpose, I still want to grant him the "Best Original Software Snippet Based On Any XXX* Language" ;-)

* XQuery, XSLT, XPath

Cheers!

Current Thread