Subject: RE: [xsl] Safe-guarding codepoints-to-string() from wrong input From: "Michael Kay" <mike@xxxxxxxxxxxx> Date: Wed, 20 Dec 2006 15:19:54 -0000 |
There's no obvious way of doing this within the language, other than defining a function that knows which codepoints are valid characters. In Saxon, there's an internal method which should be easy enough to call as an extension function: <xsl:if test="nc:isXML11Valid($codepoint)" xmlns:nc="java:net.sf.saxon.om.XML11Char"> or <xsl:if test="nc:isXML10Valid($codepoint)" xmlns:nc="java:net.sf.saxon.om.XML10Char"> depending on which version of XML you are using. You could of course run this on all the possible codepoints to generate a lookup file: you'll want to use keys to make the lookup efficient. Michael Kay http://www.saxonica.com/ > -----Original Message----- > From: Abel Braaksma [mailto:abel.online@xxxxxxxxx] > Sent: 20 December 2006 14:34 > To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx > Subject: [xsl] Safe-guarding codepoints-to-string() from wrong input > > Hi all, > > In some translation-stylesheet, I take user-input (arbitrary > string) and replace a set of numbers to a set of characters, > like this: > > $input = "some [34]quoted[34] string" > output --> some "quoted" string > > <xsl:analyze-string select="$input" regex="\[(\d+)\]"> > <xsl:matching-substring> > <xsl:value-of > select="codepoints-to-string(xs:integer(regex-group(1))" /> > </xsl:matching-substring> > <xsl:non-matching-substring> > <xsl:value-of select="." /> > </xsl:non-matching-substring> > </xsl:analyze-string> > > Because we are talking tons of data containing the above-like > strings (in text files), I'd like to make the > codepoints-to-string() a bit more rock-solid. In normal > operation, it fails hard. But I'd like it to gracefully > degrade: be liberal in what you accept. > > I know that control characters are not allowed and throw an > "Invalid XML character" error. Also, when adding very wide > numbers (like "1234567") give a plural of the same error (Im > not sure why). Some characters (I believe these are the ones > that are not assigned in Unicode) result in an empty string > (like "12345"). > > Is there a robust way of allowing/disallowing a set of > codepoints (other than making one huge lookup list)? > > Cheers, > Abel
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] Safe-guarding codepoints-, Abel Braaksma | Thread | Re: [xsl] Safe-guarding codepoints-, Abel Braaksma |
Re: [xsl] Safe-guarding codepoints-, Abel Braaksma | Date | Re: [xsl] Positional grouping with , Andrew Welch |
Month |