Re: [xsl] character map "range" in XSLT

Subject: Re: [xsl] character map "range" in XSLT
From: "G. Ken Holman" <gkholman@xxxxxxxxxxxxxxxxxxxx>
Date: Wed, 12 May 2010 14:18:20 -0400
At 2010-05-12 13:49 -0400, David wrote:
I'm writing a XSLT that has to translate XML to plain ascii text. The XML contains unicode characters, possibly any of them. I cannot control the authoring so I must handle whatever is thrown at me.

I have a few dozen specially know character translations for things like 1/4 and degrees unicode symbols.
But I have a need to "catch all" charactors that are not mapped explicitly (rather then map explicitly the entiure unicode set) and translate them into something like "<UNKNOWN CHARACTER>"


Any suggestions on how to do this ? I could trivially write a post-processor to do this (maybe a dozen lines of C or java) but if there's a feature directly in XSLT I'd love to try that.

Any ideas welcome !

You could try a general match on all text nodes and then using Unicode code points to accept only ASCII text between code points 32 and 126 (or 127 depending on your need)(and I've included some diagnostic since that might help the reader):


 <xsl:template match="text()">
    <xsl:for-each select="string-to-codepoints(.)">
      <xsl:value-of select="if ( . ge 32 and . le 127 )
                            then codepoints-to-string(.)
                            else concat('&lt;UNKNOWN CHARACTER-',.,'>')"/>
    </xsl:for-each>
  </xsl:template>

It could be slow, but I think it will be faster than using substring().

Remember there is an ISO DSDL standard that is for validating exactly this: the use of Unicode characters in an XML document. It is called CREPDL for "Character Repertoire Description Language":

 http://www.iso.org/iso/catalogue_detail.htm?csnumber=51085
 http://www.asahi-net.or.jp/~eb2m-mrt/crepdl/ns/structure/1.0/index.xml
 http://www.assembla.com/spaces/CrepdlValidatorInFsharp

I understand you are implementing a transformation and character-level validation doesn't apply, but since you have such a requirement for using only a subset of characters, there may be a role for CREPDL in your information/validation flow in addition to what you are asking for in this post.

I hope this helps.

. . . . . . . . . . . Ken

--
XSLT/XQuery training:   after http://XMLPrague.cz 2011-03-28/04-01
Vote for your XML training:   http://www.CraneSoftwrights.com/s/i/
Crane Softwrights Ltd.          http://www.CraneSoftwrights.com/s/
G. Ken Holman                 mailto:gkholman@xxxxxxxxxxxxxxxxxxxx
Male Cancer Awareness Nov'07  http://www.CraneSoftwrights.com/s/bc
Legal business disclaimers:  http://www.CraneSoftwrights.com/legal

Current Thread