RE: [xsl] Safe-guarding codepoints-to-string() from wrong input

Subject: RE: [xsl] Safe-guarding codepoints-to-string() from wrong input
From: "Michael Kay" <mike@xxxxxxxxxxxx>
Date: Wed, 20 Dec 2006 15:19:54 -0000
There's no obvious way of doing this within the language, other than
defining a function that knows which codepoints are valid characters.

In Saxon, there's an internal method which should be easy enough to call as
an extension function:

<xsl:if test="nc:isXML11Valid($codepoint)"
xmlns:nc="java:net.sf.saxon.om.XML11Char">

or

<xsl:if test="nc:isXML10Valid($codepoint)"
xmlns:nc="java:net.sf.saxon.om.XML10Char">

depending on which version of XML you are using.

You could of course run this on all the possible codepoints to generate a
lookup file: you'll want to use keys to make the lookup efficient.

Michael Kay
http://www.saxonica.com/

> -----Original Message-----
> From: Abel Braaksma [mailto:abel.online@xxxxxxxxx] 
> Sent: 20 December 2006 14:34
> To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
> Subject: [xsl] Safe-guarding codepoints-to-string() from wrong input
> 
> Hi all,
> 
> In some translation-stylesheet, I take user-input (arbitrary 
> string) and replace a set of numbers to a set of characters, 
> like this:
> 
> $input = "some [34]quoted[34] string"
> output --> some "quoted" string
> 
> <xsl:analyze-string select="$input" regex="\[(\d+)\]">
>     <xsl:matching-substring>
>         <xsl:value-of
> select="codepoints-to-string(xs:integer(regex-group(1))" />
>     </xsl:matching-substring>
>     <xsl:non-matching-substring>
>         <xsl:value-of select="." />
>     </xsl:non-matching-substring>
> </xsl:analyze-string>
> 
> Because we are talking tons of data containing the above-like 
> strings (in text files), I'd like to make the 
> codepoints-to-string() a bit more rock-solid. In normal 
> operation, it fails hard. But I'd like it to gracefully 
> degrade: be liberal in what you accept.
> 
> I know that control characters are not allowed and throw an 
> "Invalid XML character" error. Also, when adding very wide 
> numbers (like "1234567") give a plural of the same error (Im 
> not sure why). Some characters (I believe these are the ones 
> that are not assigned in Unicode) result in an empty string 
> (like "12345").
> 
> Is there a robust way of allowing/disallowing a set of 
> codepoints (other than making one huge lookup list)?
> 
> Cheers,
> Abel

Current Thread