Re: [xsl] RE: Smart Quote Encoding

Subject: Re: [xsl] RE: Smart Quote Encoding
From: Abel Braaksma <abel.online@xxxxxxxxx>
Date: Fri, 14 Sep 2007 03:17:15 +0200
Hi Roger,

see below



Roger L. Cauvin wrote:
messages to XML.  For various reasons (including troubleshooting), I would
like to log the content of the e-mails exactly.  It sounds like that's
simply not possible in XML, at least to the extent that "text-only" can
include characters not allowed in XML.
I have a program that is receiving text-only e-mails and logging the

Text only is something that is very much misunderstood, simply because there ain't such a thing as text-only. Is that IBM-437 format (that is: original DOS). Is that ASCII? Some people call it plain old ascii when it is in fact windows-1252. And many people don't know the differences between windows-1252 and ISO-8859-1. If text-only means UTF-8, it will contain a lot of "binary" looking bytes if viewed with windows-1252 encoding. If you have a mail from gmail, it will be send in UTF-7 (yes, I know, strange). Which is yet another binary format for text-only.


And last but not least, if you have text-only in EBCDIC encoding, it will look like a mess in Windows. But luckily, XML can handle all these encodings, but parsers are only required to handle UTF-8 and UTF-16. Saxon, in my experience, can handle ebcdic well, but not UTF-7 (but that's to blame with Sun who's been refusing for ages to include UTF-7).

I inserted the ISO 8859-1 encoding declaration myself.  Apparently, Saxon
6.3 doesn't support windows-1252 encoding.  Saxon 8.9J, which I just now
installed, does appear to support that encoding.  However, it still
(correctly) flags the U+18 character as illegal.

that does not depend on Saxon but on the Java version you use. Btw, why would you want to use an age-old version of Saxon? Saxon 8.9 can do XML 1.1 and that can represent character U+18.


But you should workout what the encoding is your email program is using. It may be that it stores the file in the same encoding as it is received, but it is more likely that the email program transforms it into some other format. If you read your email from the interface of the email program (i.e., Thunderbird) then you will see what is done with the code. But when you view your email in bare format, you'll have to find out what the email program does to it. In the case of Thunderbird, I believe it stores all in one file and encodes it as UTF-8, but I am not sure (and you probably use a different mailer). Maybe TB even stores different text formats in one file...


More about U+18
---------------
Unfortunately, without knowing what the byte sequence of that character is and without a binary view of your whole file, it is quite hard to determine what the real encoding is. Like David already suggested: you should check the encoding in your file.


Furthermore, you are dealing with a lot of text it seems. If you want more control over what encoding you can choose, you can try unparsed-text($url, $encoding) in XSLT 2.0. Note that you must remove the XML declaration then, because the spec states that the encoding in the declaration has higher precedence than the user specified encoding.

If you want to check a bunch of encodings all at once and see if one fits, you can lookup the list of supported encodings at Sun's website. Make it into a sequence, i.e. <xsl:variable name="encodings" select=" 'utf-8', 'utf-16', 'windows-1252', 'IBM500', 'Big5', 'utf-16BE' " /> and you can loop through all possibilities using:

<xsl:value-of select=" for $enc in $encodings return $enc, unparsed-text-available($url, $enc) " separator=" " />

But to me, it sounds like you have encountered a mail in the lesser used UTF-7 encoding. You were talking of the windows quote. Let's examine that (assuming that you are correct in your analysis that the quote appears where you have found U+18 (which is not U+18, but a byte with hex value 18 that does not translate to the correct character in the encoding that you guessed the file was in)), and this 18h is part of a longer byte sequence where the smart quote is used. The smart code, I hope, is &#x201D; (or &_#x201D if the browser/mailer screws it up), which is the one MS Word uses.

You read the file using 'ISO-8859-1'. That means that the first 127 bytes equal there counterparts in the Unicode table (i.e. U+0 to U+7F are the same as the bytes in the stream).

Now, let's see how the smart quote really should look like in certain encodings that I think are likely to be encountered in mail (alle bytes zijn in hex):

windows-1252: 93
iso-8859-1: 3F (question mark, i.e., cannot be represented)
Big5: A1A7
UTF-8: E2809C
UTF-16: 201C
UTF-16BE: 201C
UTF-16LE: 1C20
UTF-7: /v8gHA (not the hex representation, that would be: 2F7638674851)
Shift-Jis: 8167
IBM500: 3F (wrongly represented by the serializer: SUB is a control code in IBM encodings)
GB 18030: A1B0
MacRoman: D2 (the mac codepage actually has a 'smart quote')
EUC-JP: A1C8



I got this information by combining saxon:serialize, saxon:string-to-hexBinary and saxon:string-to-base64Binary (the last one for creating the not supported and non-unicode standard UTF-7). As you can see, there is no encoding in existence that has ever used byte 18 to represent a high character.


The only situation that I can think of to legally have a byte 18 in a sequence of bytes to represent a character, is in multi byte encoding formats. I.e., UTF-16, GB 18030, Big5 etc. I believe that, because it is a control character, that you will only find it as the second byte in any two byte sequence and never as the first byte.


All this just for one character that is illegally encoded in the source? Well, email programs, like I said, save their data in a variety of formats. Or set the output options to XML 1.1 and use unparsed-text.


Hope this "little" story clarifies things a bit for you.

cheers,
-- Abel Braaksma

Current Thread