Re: [xsl] Safe-guarding codepoints-to-string() from wrong input

Subject: Re: [xsl] Safe-guarding codepoints-to-string() from wrong input
From: Florent Georges <darkman_spam@xxxxxxxx>
Date: Wed, 20 Dec 2006 16:16:26 +0100 (CET)
Abel Braaksma wrote:

  Hi

> I know that control characters are not allowed and throw
> an "Invalid XML character" error. Also, when adding very
> wide numbers (like "1234567") give a plural of the same
> error (Im not sure why). Some characters (I believe these
> are the ones that are not assigned in Unicode) result in
> an empty string (like "12345").

> Is there a robust way of allowing/disallowing a set of
> codepoints (other than making one huge lookup list)?

  Technically, it is not complex.  Just define a function
my:codepoints-to-string() that makes the needed checks and
do what you want when encoutering an invalid codepoint.  I
think the most difficult part is identifying which
codepoints are valid.  You can use the following from the
XML recommendation as starting point:

    /* any Unicode character, excluding the surrogate
       blocks, FFFE, and FFFF. */
    [2] Char ::= #x9
                 | #xA
                 | #xD
                 | [#x20-#xD7FF]
                 | [#xE000-#xFFFD]
                 | [#x10000-#x10FFFF]

    Document authors are encouraged to avoid "compatibility
    characters", as defined in section 6.8 of [Unicode] (see
    also D21 in section 3.6 of [Unicode3]). The characters
    defined in the following ranges are also
    discouraged. They are either control characters or
    permanently undefined Unicode characters:

    [#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF],
    [#x1FFFE-#x1FFFF], [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF],
    [#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF], [#x6FFFE-#x6FFFF],
    [#x7FFFE-#x7FFFF], [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF],
    [#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF], [#xCFFFE-#xCFFFF],
    [#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF],
    [#x10FFFE-#x10FFFF].

  When you have identified the (in)valid codepoints, you
will have to choose what to do with (in)valid codepoints.
For example, calling codepoints-to-string() for valid
codepoints, and return the empty sequence or the empty
string for invalid one:

    <xsl:function name="my:is-in-range" as="xs:boolean">
      <xsl:param name="value" as="xs:integer"/>
      <xsl:param name="down"  as="xs:integer"/>
      <xsl:param name="up"    as="xs:integer"/>
      <xsl:sequence select="$value ge $down and $value le $up"/>
    </xsl:function>

    <xsl:function name="my:is-valid-codepoint" as="xs:boolean">
      <xsl:param name="cp" as="xs:integer"/>
      <xsl:sequence select="
          $cp = (9, 10, 13)
            or my:is-in-range($cp,    32,   55295)
            or my:is-in-range($cp, 57344,   65533)
            or my:is-in-range($cp, 65636, 1114111)"/>
    </xsl:function>

    <xsl:function name="my:codepoint-to-string" as="xs:string?">
      <xsl:param name="cp" as="xs:integer"/>
      <xsl:if test="my:is-valid-codepoint($cp)">
        <xsl:sequence select="codepoints-to-string($cp)"/>
      </xsl:if>
    </xsl:function>

or instead the following, depending on your needs:

    <xsl:function name="my:codepoints-to-string" as="xs:string">
      <xsl:param name="cp" as="xs:integer*"/>
      <xsl:sequence select="
          codepoints-to-string($cp[my:is-valid-codepoint(.)])"/>
    </xsl:function>

  Regards,

--drkm
























	

	
		
___________________________________________________________________________ 
Yahoo! Mail riinvente le mail ! Dicouvrez le nouveau Yahoo! Mail et son interface rivolutionnaire.
http://fr.mail.yahoo.com

Current Thread