RE: [xsl] Generating numeric character references

Subject: RE: [xsl] Generating numeric character references
From: "Andrew Welch" <AWelch@xxxxxxxxxxxxxxx>
Date: Thu, 16 Jan 2003 09:44:37 -0000
I think the original poster had a problem of double escaping, such as

& a m p ; # 1 7 3 ;

in their source, and they simply wanted to convert this to the correct & # 1 7 3 ;

Wouldn't running the source xml through an indentity transform would give the desired result, no need for string processing of any kind.....

cheers
andrew


> -----Original Message-----
> From: Wendell Piez [mailto:wapiez@xxxxxxxxxxxxxxxx]
> Sent: 14 January 2003 21:55
> To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
> Subject: Re: [xsl] Generating numeric character references
> 
> 
> Stuart,
> 
> The reason your task is proving difficult is that it's really 
> not what it 
> appears to be at first blush. There is a trap here, which you 
> can recognize 
> if you can clearly distinguish between XML-as-serialization 
> format, and the 
> XML document (a tree of nodes as described in the XPath spec) 
> that an XSLT 
> processor operates on.
> 
> Numeric character references may appear in 
> XML-as-serialization; in the 
> XPath tree (the "document" built by the parser and handed to the XSLT 
> engine), however, these references never appear as such; 
> rather, each has 
> been converted into the character it represents.
> 
> So, for example, if your data has character reference &#x41;, 
> your XSLT 
> processor sees this as an "A". (It may come out the back as 
> "&#x41;" if 
> your serialization encoding happens not to be able to do a 
> proper "A", but 
> internally it's an "A"). Therefore, what's required with 
> "&amp;#x41;" isn't 
> to turn it into "&#x41;", but rather into "A". (Or, if you 
> get my drift: 
> you need to convert "&amp;#x41;" into "&#x41;" *before* your 
> document is 
> parsed, or an "&#x41;" into an "A" *after* your document is parsed.)
> 
> You are currently trying to do the latter; and it can be done 
> -- as you're 
> discovering -- with recursive processing over text nodes, 
> heuristics to 
> recognize target substrings, and a table to map them. But 
> it's not a job 
> that XSLT lends itself towards, since XSLT is as ungainly for 
> processing 
> strings as it is slick for processing nodes. Far preferable 
> would be to use 
> Perl or something else with good support for string-handling 
> and regular 
> expressions, to do the former task (munge the &amp; entities 
> before parsing).
> 
> Yet -- and this is where one has to be *very* cautious -- 
> XSLT does, at 
> least in certain circumstances (i.e. with certain processors 
> in certain 
> operational contexts) give you *some* control over how a 
> document, once 
> processed, is serialized -- and *if your data is clean* this optional 
> feature can be drafted into service to help with your 
> problem. What I'm 
> getting to, of course, is the dreaded disable-output-escaping....
> 
> That is, if your data is otherwise unproblematic, you can 
> achieve your goal 
> by running your document through a near-identity transform 
> that disables 
> output escaping on your text nodes. The document will emerge from the 
> transform unchanged (at least as XPath sees it) but with "&amp;#x41" 
> represented as "&#x41;". This, *when parsed again*, will be 
> seen as the "A" 
> you really want.
> 
> Note that this is not (if we're really strict with our terms) a 
> transformation in the XSLT sense. Rather, it's a tricky 
> application of the 
> serializer attached to most processors, will sometimes break 
> because it 
> disables escaping on the wrong characters (so if you have any 
> data such as 
> "if x &lt; y", you're going to be in trouble unless you write 
> string-processing code to catch and work around it), and uses 
> an optional 
> feature of the language that restricts portability.
> 
> Please consider this only a golden-hammer solution (i.e. 
> lacking a better 
> tool to do the job), and keep in mind it's easy to bang your 
> thumb this way 
> (since any anomalies in the input will make your output not 
> well-formed). 
> It is in view of these limitations that this really should be 
> done in a 
> separate pass, if with XSLT at all.
> 
> Cheers,
> Wendell
> 
>   At 03:05 PM 1/14/2003, you wrote:
> >I'd like to transform specific text subtrings into numeric character
> >references during in an XSLT transformation. For example, I want to
> >transform all occurrences that look like "&amp;#173;" within a string
> >into "&#173".
> >
> >Here's the back story. I have source XML that is generated 
> automatically
> >from HTML by a third-party. The third-party incorrectly 
> handles entity
> >references, so that "&#173;" in the original HTML in becomes
> >"&amp;#173;" in the XML. I want to restore the damage done. 
> To simplify
> >things, I am only interested in documents with ISO 8859-1 encoding.
> >
> >Below is a solution [1] that I am not pleased with. It is a named
> >template that recursively parses a string, trying to replace 
> references.
> >This requires an <xsl:when> element for each value of 
> numeric character
> >reference that might be encountered (see the "additional cases here"
> >comment). Problems with this include linear search of values, omitted
> >values, and opportunity for error in mismatched values.
> >
> >Can anyone suggest a better approach to generating numeric character
> >references? I am would be fine restricting the solution to MSXML or
> >.NET's System.Xml.Xsl XSLT processors, if that is an issue.
> >
> >Many thanks!
> >
> >Cheers,
> >Stuart
> >
> >
> >
> >[1] A less than happy solution:
> >
> >   <xsl:template name="restoreNumCharRefs">
> >     <xsl:param name="string"/>
> >
> >     <xsl:choose>
> >       <xsl:when test="contains($string, '&amp;')">
> >         <xsl:variable name="head" select="substring-before($string,
> >'&amp;')"/>
> >         <xsl:variable name="remainder" 
> select="substring-after($string,
> >'&amp;')"/>
> >         <xsl:variable name="reference"
> >select="substring-before($remainder, ';')"/>
> >
> >         <xsl:variable name="entity">
> >           <xsl:choose>
> >             <xsl:when test="$reference='#167'">&#167;</xsl:when>
> >             <xsl:when test="$reference='#173'">&#173;</xsl:when>
> >
> >             <!-- additional cases here -->
> >
> >             <xsl:otherwise>&amp;<xsl:value-of
> >select="$reference"/>;</xsl:otherwise>
> >           </xsl:choose>
> >         </xsl:variable>
> >
> >         <xsl:variable name="tail">
> >           <xsl:call-template name=" restoreNumCharRefs">
> >             <xsl:with-param name="string"
> >select="substring-after($remainder, ';')"/>
> >           </xsl:call-template>
> >         </xsl:variable>
> >
> >         <xsl:value-of select="concat($head, $entity, $tail)"/>
> >       </xsl:when>
> >       <xsl:otherwise>
> >         <xsl:value-of select="$string"/>
> >       </xsl:otherwise>
> >     </xsl:choose>
> >
> >   </xsl:template>
> >
> >
> >  XSL-List info and archive:  
http://www.mulberrytech.com/xsl/xsl-list


======================================================================
Wendell Piez                            mailto:wapiez@xxxxxxxxxxxxxxxx
Mulberry Technologies, Inc.                http://www.mulberrytech.com
17 West Jefferson Street                    Direct Phone: 301/315-9635
Suite 207                                          Phone: 301/315-9631
Rockville, MD  20850                                 Fax: 301/315-8285
----------------------------------------------------------------------
   Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list



---
Incoming mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.441 / Virus Database: 247 - Release Date: 09/01/2003
 

---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.441 / Virus Database: 247 - Release Date: 09/01/2003
 

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread