Re: [xsl] output to iso-8859-1 of non-iso characters, what is required action

Subject: Re: [xsl] output to iso-8859-1 of non-iso characters, what is required action
From: Michael Müller-Hillebrand <mmh@xxxxxxxxxxxxx>
Date: Wed, 7 May 2008 16:39:02 +0200
Bryan,

don't mix 'characters' with 'bytes'. iso-8859-1 is a codepage that
assigns a number of characters to certain bytes in the range of 0255.

In XML a character may be displayed in different ways, all perfectly
legal: A, &#65; &#x41;

I seem to remember that it is totally up to the processor to select a
method. If you use Saxon there are special options to control that
behaviour (if you prefer native bytes, decimal or hex entities).

Dropping characters is never an option. If you want that you could
easily filter using translate() to remove all unwanted characters
from text nodes.

- Michael M|ller-Hillebrand

Am 07.05.2008 um 16:24 schrieb bryan rasmussen:

Hi,

XSL 1 question. Just wanted to run something by you all:

If I specify my output is iso-8859-1 and I am outputting a character
that is not iso-8859-1, for example by putting the value of text node
from an UTF-8 document, what is the processor required to do?

1. Fail with error warnings
2. implementation specific, can decide to fail, discard non iso
characters in output, provide settings for processor so choose at
processing time.
3. remove non iso characters from output, do not fail.

and if it is 2 should the default be 1 or 3?

I personally go with 3, but according to the spec if it is text:

The text output method outputs the result tree by outputting the
string-value of every text node in the result tree in document order
without any escaping.

The media-type attribute is applicable for the text output method. The
default value for the media-type attribute is text/plain.

The encoding attribute identifies the encoding that the text output
method should use to convert sequences of characters to sequences of
bytes. The default is system-dependent. If the result tree contains a
character that cannot be represented in the encoding that the XSLT
processor is using for output, the XSLT processor should signal an
error.


BUT for XML


The encoding attribute specifies the preferred encoding to use for
outputting the result tree. XSLT processors are required to respect
values of UTF-8 and UTF-16. For other values, if the XSLT processor
does not support the specified encoding it may signal an error; if it
does not signal an error it should use UTF-8 or UTF-16 instead. The
XSLT processor must not use an encoding whose name does not match the
EncName production of the XML Recommendation [XML]. If no encoding
attribute is specified, then the XSLT processor should use either
UTF-8 or UTF-16. It is possible that the result tree will contain a
character that cannot be represented in the encoding that the XSLT
processor is using for output. In this case, if the character occurs
in a context where XML recognizes character references (i.e. in the
value of an attribute node or text node), then the character should be
output as a character reference; otherwise (for example if the
character occurs in the name of an element) the XSLT processor should
signal an error.

which I take to mean that if I am outputting an XML document with
iso-8859-1 encoding and I have a utf-8 character in a text node and I
use the value of that text-node to make the value of a text-node in
the output then the character should be automatically changed to a
character reference.

But if I am outputting a text document with iso-8859-1 then the
presence of non-iso characters in the output will raise an error.

Cheers,
Bryan Rasmussen



-- _______________________________________________________________ Michael M|ller-Hillebrand: Dokumentations-Technologie Adobe Certified Expert, FrameMaker Lvsungen und Training, FrameScript, XML/XSL, Unicode <http://cap-studio.de/> -- Tel. +49 (9131) 28747

Current Thread