RE: [xsl] Generating numeric character references

Subject: RE: [xsl] Generating numeric character references
From: Wendell Piez <wapiez@xxxxxxxxxxxxxxxx>
Date: Thu, 16 Jan 2003 12:23:45 -0500
Andy,

At 04:44 AM 1/16/2003, you wrote:
I think the original poster had a problem of double escaping, such as

& a m p ; # 1 7 3 ;

in their source, and they simply wanted to convert this to the correct & # 1 7 3 ;

Thanks for spacing for legibility. I didn't do that (and I wonder now if it made my post unintelligible to anyone -- sorry).


Wouldn't running the source xml through an indentity transform would give the desired result, no need for string processing of any kind.....

Well, not exactly. There's a problem here between "reality" and "representation". In order to beg the metaphysical problem here of which is which (a problem which is not negligible, indeed is at the heart of a deep contention respecting appropriate design strategies for the XML family of specs), I'll call them "external" and "internal". "External" means XML-as-serialized; it's also the way you write a stylesheet (which is, after all, XML serialized). "Internal" means the XPath tree once the parser has done its job and handed the structured data to the processor.


What Stuart wants is to move from "Before" to "After":

external internal

Before & amp; #x41; & #x41;

After & #x41; A

A straight identity transform would:

1. Parse "external" into "internal". & amp; #x41; becomes & #x41;
2. Input tree is copied to output tree (identity transform)
3. Output tree is serialized: internal is expressed as external and & #x41; becomes & amp; #x41;


So the straight identity transform keeps "Before" as "Before" (as it should, being an identity transform).

But Stuart doesn't want Before; he wants After. While this may seem like it ought to be trivial (internally, what we have Before is exactly what we want externally After), it's not, since we have to get across the architectural boundary (Mike K's phrase) between the serialized XML and the parsed XML-as-XPath-tree. If parsers and serializers are doing their jobs properly, they shouldn't allow this -- an internal "& #x41;" should always serialize as "& amp; #x41;", no exceptions (please elide the safety spaces here: I just hate e-mail clients that parse plain text!).

Tom P's suggestion is to pre-process, observing that the simplest and cleanest approach is to run a routine over the external form of the XML to turn Before into After, and not to worry about the parser (not to worry about what's "internal") until he's got the data the way he wants it. Architecturally, this is a good solution (it maintains the boundary), and it'll be speedy since he'll use a tool (I think Tom would use Python ;-) well-suited for string-munging without XML parsing.

If Stuart must do this inside the XSLT processor, however, he has no choice but to work on the internal form.

His first approach was to map occurrences of the string "& #x41;" (again no space) into the correct character, "A" (and let the serializer do whatever it wants with the result). Like Tom's approach, this is safe, since it respects the boundary, but (as Stuart noted) its performance may be questionable, and it's something of a pain to program (XSLT isn't as well suited for string-processing as many other tools).

It's also -confusing- since while you are really changing *& x#41;" into "A", it appears you are changing "& amp; #x41;" into "& #41;", since of course, *you see the external representation both in your source and result files, and in your XSLT*.

My evil suggestion (and more confusing) was to commandeer his serializer into writing "& #x41;" for the internal "& #x41;". (On my diagram, this amounts to using the serializer to jump diagonally instead of using an orthodox process to move vertically.)

This may be an acceptable brute force method, sometimes. It will gain speed over the internal-mapping approach. Unlike an external process (pre-process), it happens within the XSLT architecture (or rather, across it). It is fairly simple to program.

It *does* require that the data is otherwise sparkly-clean, or the wrong characters will fail to get escaped on serializing, the "XML" will not be well-formed coming out, and Stuart will be hosed, unable to parse his data until he fixes it with non-XML string-munging tools (which is what he says he's not allowed to do).

Sorry for the long post, but it's a tricky topic and one that gets lots of folks really stuck.

Cheers,
Wendell


====================================================================== Wendell Piez mailto:wapiez@xxxxxxxxxxxxxxxx Mulberry Technologies, Inc. http://www.mulberrytech.com 17 West Jefferson Street Direct Phone: 301/315-9635 Suite 207 Phone: 301/315-9631 Rockville, MD 20850 Fax: 301/315-8285 ---------------------------------------------------------------------- Mulberry Technologies: A Consultancy Specializing in SGML and XML ======================================================================


XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list



Current Thread