Subject: Re: [xsl] decoding percent-escaped octet sequences|
From: Chris Maloney <voldrani@xxxxxxxxx>
Date: Fri, 20 May 2011 13:22:08 -0400
On Fri, May 20, 2011 at 12:14 PM, Julian Reschke <julian.reschke@xxxxxx> wrote: > On 2011-05-20 17:52, Brandon Ibach wrote: >> Generally, when you're doing string manipulations inside XSLT/XPath, >> there really is no such thing as ISO-8859-1, UTF-8 or any other >> encoding, since the "string" data type in XPath is just a string of >> Unicode characters. But Julian is right that a percent-encoded string, which represents a byte sequence, can be considered to be encoded in one or another way. I investigated this same kind of thing for the site I work on, and I don't have a solution for how to convert these to strings inside XSLT, but I thought I'd just paste some of the test cases I worked with, in case they might prove interesting or useful. 1. UTF-8 encoded single character A. ?term=%C3%84rzteblatt "Crzteblatt" 2. Invalid character codes (ASCII control character(s), but not valid ISO-8859-1 or UTF-8) A. ?term=%02%03cat 3. Non UTF-8, ISO-8859-1, single character A. ?term=%C4rzteblatt "Crzteblatt" 4. Invalid byte sequence (not valid utf-8 or iso-8859-1) A. ?term=%C4%83%C4cat 5. Chinese characters, UTF-8 encoded A. ?term=%e4%bd%a0%e5%a5%bd Search box: "d= e%=" 6. ISO-8859-1 multi-byte - this sequence starts out looking like UTF-8, but it's not. A. ?term=%c4%A0%c4rzteblatt Search box: "C Crzteblatt" After working with this for a while, we reached the conclusion that it's best to try to strictly enforce the rule that percent-encoding in URLs be UTF-8. In other words, I think it's a bad idea to try to continue to maintain ISO-8859-1 encoded URLs, because it just leads to too many possible problems, that are very hard to debug.