Re: [xsl] text extraction

Subject: Re: [xsl] text extraction
From: "Andrew Welch" <andrew.j.welch@xxxxxxxxx>
Date: Thu, 12 Oct 2006 17:05:30 +0100
On 10/12/06, Abel Braaksma <> wrote:
Andrew Welch wrote:
> On 10/12/06, mus47@xxxxxxxx <mus47@xxxxxxxx> wrote:
>> And also I want to now how can the output file encoding setted to
>> iso8859-1 instead of utf8.
>> I use the xsltproc tool.
> You can set the output encoding using <xsl:output/>

But it is not guaranteed that the processor supports anything different
from UTF-8/UTF-16.

Are you sure? Interestingly the spec states:

"The value of the encoding attribute provides the value of the
encoding parameter to the serialization method. The default value is
implementation-defined, but in the case of the xml and xhtml methods
it must be either UTF-8 or UTF-16."


...which took me a little by surprise - It seems to say that when the
output method is xml or xhtml the encoding MUST be either UTF-8 or
UTF-16?  Saxon doesn't seem to mind...

Also note, the first 127 codepoints when encoded as ISO-8859-1 or UTF-8
are exactly equal. Only ISO 128 (sometimes euro sign, but you may see
something different: ) and above are treated differently.

Note that ISO-8859-1 is an order of magnitude smaller then UTF-8, so you
may end up with missing or replaced characters (not sure what they will
be replaced with though, when they don't exist) in the output stream.

No you dont end up with missing or replaced characters... Any characters not in the encoding should be output as a character reference. Its a well known technique to use an output encoding of US-ASCII so that all non-ascii characters get output as character references, which gets around read encoding problems further down the pipe.


Current Thread