[xsl] encoding and NCRs; source doc as SAX events in JAXP (was: Converting &...)

Subject: [xsl] encoding and NCRs; source doc as SAX events in JAXP (was: Converting &...)
From: Mike Brown <mike@xxxxxxxx>
Date: Sat, 26 Jul 2003 14:31:09 -0600 (MDT)
This discussion was mostly resolved off-list, but for the sake of
the archives, and because some of us can't get enough of clearing
up character encoding misconceptions...

Elizabeth Barham wrote:
> [re: "?" substitutions for unencodable or undecodable characters] 
> Is it possible to bypass this mechanism?

It's a feature of the codec that is doing the encoding or decoding. If you're
invoking it yourself, then sure, you may have other options such as raising an
exception or ignoring the unknown character or byte sequence. It depends on
the API of the codec. (I'm trying to speak in relatively language neutral
terms here)

> I would like to pass a byte
> into Java and not have it modified in anyway.

XML manifests in an encoded form (bytes) for the purposes of network
transmission and disk storage, but *parsed* XML is no longer treated as bytes
-- instead, it is treated as Unicode string objects arranged in a logical
hierarchy (elements, attributes, etc.), and this info is communicated to the
application (your "Java") as either SAX event calls or a DOM Document object.

A numeric character reference like "&#169;" manifests in the encoded,
'physical' document as a series of bytes for each character (e.g. if it is
UTF-16LE encoded, "&" is 0x00 0x26, "#" is 0x00 0x23, and so on). When the
bytes are decoded by the parser, they become a Unicode string consisting of
the 6 characters: ampersand, number sign, digit 1, digit 6, digit 9,
semicolon. The parser recognizes this markup as longhand for the single
Unicode character: copyright symbol (Unicode character number 169), so that's
what it reports to the application.

Your problem is most likely fixable with a very simple change to one line of
your application's code, and knowing what to fix will be possible when you
fully grasp the XML processing model and the underlying character encoding
model, as well as the nuances of your application platform's codec APIs.

I'd like to help further, but you'll need to boil it down to a simple
bit of code that reproduces the error so I can see exactly what's going on.
Off-list, please.

> But, I *do* have an XSLT question to ask as and addendum. What is the
> best way to drive the xml input of an XSLT formatter from inside a
> java class? 
> For example, let us say that I have an XSLT stylesheet that is set up
> to expect a certain format, and I have a java class whose data I would
> like to have processed by said stylesheet. It seems a waste to make a
> StringBuffer of things like "<?xml version='1.0'?><doc><t>x</t></doc>"
> and then pass it into the transformer since it would be possible to
> generate the SAX events from within the Java class.
> Looking at javax.xml.parsers.SAXParser, I notice the parse() function,
> but those seem to be dealing with incoming streams and not events.

Yes, parsers generally rely on their input being bytes, which is implicitly
mandated by the XML 1.0 spec. Convenience APIs have emerged over the years,
operating at various levels, to accept different kinds of input (URIs,
pre-decoded Unicode streams, DOM objects) but they typically all end up
converting these to bytes, behind the scenes, for the underlying parser's
benefit. [expat's, at least...]

For transformations, I *think* you can generate your source document as SAX
events that a JAXP application can utilize, but I'm not sure if you can just
create a parser and start calling handler methods, or if you have to implement
an XMLReader, or what. Maybe someone else can provide an example of how to do
it? I've never tried it myself. Note that your XSLT processor might come with
some helpful examples, e.g. examples/java/TraxExamples.java in Saxon's
distribution .zip.

As for whether it's better than marshalling your object's data into XML
markup, I'd give some consideration to the maintainability and scalability of
your code. Generating markup is going to be easier to understand, problems
with it are going to be easy to diagnose, its output will be more widely
useful, and it will probably not be all *that* much slower, in typical use
cases, than marshalling into a series of SAX events. Just my 2c.


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list

Current Thread