Re: [xsl] recognize character entities

Subject: Re: [xsl] recognize character entities
From: Abel Online <abel.online@xxxxxxxxx>
Date: Wed, 30 Aug 2006 11:41:31 +0200
Florent Georges wrote:
<xsl:variable name="entity.values"
select="('&#65533;...', '&#65533;...', ...)"/>
Perhaps it is easier, if I may suggest so, to use regular expressions. I think they would require a lot less work to create, because often the character entities used for MathML are inside ranges. Looking around at the entity tables on http://www.w3.org/TR/2003/REC-MathML2-20031021/chapter6.html#chars.entity.tables, I found that most sets are more a less complete parts from the Unicode 4.0 specification.

For instance, almost all characters in the range 0x02200 - 0x022FF are included (Mathematical Operators subset in Unicode). The regular expression for this is: [\x2200-\x22FF]. I'm not sure if processor dig this too: Mathematical symbols ought to be matched with the simple expression: \P{Sm}.

Similar constructs are available for Greek and Cyrillic: \P{IsGreek} and \P{IsCyrillic}.

Some ranges may be too wide, but perhaps there is little chance your code contains symbols not used by MathML, but available to Unicode.

Some characters are specified by MathML with a combining diacritical mark. I think you will have to list them separately in your regular expression. Same is true for the "normal" Latin-1 characters that are part of MathML, like &amp;, &aacute;, &Acirc; etc.

Using this approach you do not have to wonder if a characther entity is written using its numeric equivalent, the hexadecimal notation or the named notation.

Of course, it will take a few hours to construct your regex, but I think it will be much easier to maintain than a list of all entity values. And, forgot to say, you can only use it with XSLT 2.0 capable processors.

Hope this helps,

Cheers,
Abel Braaksma
http://abelleba.metacarpus.com

Current Thread