Re: [xsl] How to read the encoding of an XML document

Subject: Re: [xsl] How to read the encoding of an XML document
From: Wendell Piez <wapiez@xxxxxxxxxxxxxxxx>
Date: Thu, 25 Oct 2001 14:14:43 -0400
So, James,

The bottom line is that what you want to do isn't readily possible, mainly because in order to define a standard, XML has to limit the kinds of encoding that processors are required to support. Whether a given parser can parse a given encoding or whether an XSLT processor can write out a given encoding, is up to the processor. The only thing the XML standard stipulates is that a parser be able to read the standard Unicode character sets.

One way to work around the problem would be to carry the encoding you want as a parameter. (For this purpose you could preprocess the file to look in the XML declaration and get that pseudo-attribute.) Unfortunately, since you can't parameterize this setting in the stylesheet either, you won't be able to rely on the processor's own serializer, but will have to work around the back end as well. Maybe someone on the list could suggest how: for example, by having the processor construct a DOM and then running the DOM tree through your own serializer that would do the transcoding.

But this is a pretty steep requirement: in effect you're saying "whatever character encoding you want to give me, that's okay", but processors aren't going to like that even in the best of all possible worlds.

Cheers,
Wendell

At 11:53 AM 10/25/01, David wrote:
> When you say Unicode, does that equate to UTF-8, UTF-16, UTF-32 or
> something else?
No unicode is essentially an abstract collection of characters, numbered
1 to x10FFFF (most of which slots are empty). an XML notation of &#333;
refers to that abstract character number 333.

However to store unicode strings in files (and other places) you need
some encoding that maps bytes in the file to these chracters. UTF-x are
some of those encodings (all UTF encodings  have the property that they can
encode the whole unicode range) other encodings such as ascii or latin-1
are similar, but can't encode the whole range of characters.

> Or does the answer depend upon the XML parser you are
> using, which in my case is MSXML3.0?

No. Internally the parser obviously has to use some encoding to store
things (often this is utf-16, and it is in the case of msxml) in some
programming api's you need to know this as you het handed the string,
but in XSLT you never need to know what happens internally.
Your XSLT stylesheet is an XML document so it goes through the same
process.

Character data in the stylesheet is mapped to abstract unicode
characters (using the encoding specified in the stylesheet)
and the same happens for the source document. It is these abstract
characters that are compared. So by then you don't need to know (and
can't find out) what encoding the original files contained.

So your source might be in latin-2 and your stylesheet might be in
latin-1 but by the time they have both been parsed everything is in
abstract unicode characters and it is these that are compared
in any XSLT query. (In fact MSXML3 uses utf16 but this is an internal
detail that has no affect on the stylesheet)

David


======================================================================
Wendell Piez                            mailto:wapiez@xxxxxxxxxxxxxxxx
Mulberry Technologies, Inc.                http://www.mulberrytech.com
17 West Jefferson Street                    Direct Phone: 301/315-9635
Suite 207                                          Phone: 301/315-9631
Rockville, MD  20850                                 Fax: 301/315-8285
----------------------------------------------------------------------
  Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================


XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list



Current Thread