Subject: Fw: [xsl] decoding percent-escaped octet sequences From: Hermann Stamm-Wilbrandt <STAMMW@xxxxxxxxxx> Date: Mon, 23 May 2011 11:47:58 +0200 |
Trying to send again, this time not as UTF-8 email ... ----- Forwarded by Hermann Stamm-Wilbrandt/Germany/IBM on 05/23/2011 11:47 AM ----- From: Hermann Stamm-Wilbrandt/Germany/IBM To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx Date: 05/23/2011 10:37 AM Subject: Re: [xsl] decoding percent-escaped octet sequences DataPower provides a convert-http action to be able to process HTTP form submissions which are Non-XML. At the time this entered the product (before acquisition by IBM in 2005) the default encoding for URL-encoded strings was ISO-8859-1. The equivalent of convert-action to be used inside DataPower stylesheets is the dp:decode() extension function: http://publib.boulder.ibm.com/infocenter/wsdatap/v3r8m2/index.jsp?topic=/xa35 /extensionfunctions41.htm Last year a customer requested to be able to deal with UTF-8 URL-encoded URIs (because Google returns those to them). I provided an implementation for that in a technote and a Webcast: http://www-01.ibm.com/support/docview.wss?uid=swg21412370 http://www-01.ibm.com/support/docview.wss?uid=swg27019118&aid=1#page=15 This implementation is based on EXSLT extension function str:decode-uri() (DataPower is a XSLT 1.0 processor). http://exslt.org/str/functions/decode-uri/index.html I modified the stylesheet from the technote to eliminate the access to "dp:variable()". This way it even works with xsltproc, see below. $ xsltproc utf8uriDemo.xsl utf8uriDemo.xsl <?xml version="1.0"?> <request xmlns:uri="http://uri "><url>/utf8uri?danish=%C3%86-%C3%98-%C3%85&french=%C5%92-%C3%A6&germ an=%C3%84-%C3%96-%C3%9C-%C3%9F&spanish=%CA%A7-%EA%9D%86-%C3%91</url><base -url>/utf8uri</base-url><args src="url"><arg name="danish">F-X-E</arg><arg name="french">?-f</arg><arg name="german">D-V-\-_</arg><arg name="spanish">?-?-Q</arg></args></request> $ $ cat utf8uriDemo.xsl <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:str="http://exslt.org/strings" xmlns:uri="http://uri" exclude-result-prefixes="str" > <xsl:template match="/"> <xsl:variable name="url"><![CDATA[/utf8uri?danish=%C3%86-%C3%98-%C3%85&french=%C5%92-%C3%A6 &german=%C3%84-%C3%96-%C3%9C-%C3%9F&spanish=%CA%A7-%EA%9D%86-%C3%91]]></xsl:v ariable> <request> <url><xsl:copy-of select="$url"/></url> <base-url> <xsl:copy-of select="substring-before($url,'?')"/> </base-url> <args src="url"> <xsl:for-each select="str:tokenize(substring-after($url,'?'),'&')"> <xsl:element name="arg"> <xsl:attribute name="name"> <xsl:value-of select="substring-before(.,'=')"/> </xsl:attribute> <xsl:value-of select="str:decode-uri(substring-after(.,'='))"/> </xsl:element> </xsl:for-each> </args> </request> </xsl:template> </xsl:stylesheet> $ Mit besten Gruessen / Best wishes, Hermann Stamm-Wilbrandt Developer, XML Compiler, L3 Fixpack team lead WebSphere DataPower SOA Appliances https://www.ibm.com/developerworks/mydeveloperworks/blogs/HermannSW/ ---------------------------------------------------------------------- IBM Deutschland Research & Development GmbH Vorsitzender des Aufsichtsrats: Martin Jetter Geschaeftsfuehrung: Dirk Wittkopp Sitz der Gesellschaft: Boeblingen Registergericht: Amtsgericht Stuttgart, HRB 243294 From: Chris Maloney <voldrani@xxxxxxxxx> To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx Cc: Brandon Ibach <brandon.ibach@xxxxxxxxxxxxxxxxxxx> Date: 05/20/2011 07:22 PM Subject: Re: [xsl] decoding percent-escaped octet sequences On Fri, May 20, 2011 at 12:14 PM, Julian Reschke <julian.reschke@xxxxxx> wrote: > On 2011-05-20 17:52, Brandon Ibach wrote: >> Generally, when you're doing string manipulations inside XSLT/XPath, >> there really is no such thing as ISO-8859-1, UTF-8 or any other >> encoding, since the "string" data type in XPath is just a string of >> Unicode characters. But Julian is right that a percent-encoded string, which represents a byte sequence, can be considered to be encoded in one or another way. I investigated this same kind of thing for the site I work on, and I don't have a solution for how to convert these to strings inside XSLT, but I thought I'd just paste some of the test cases I worked with, in case they might prove interesting or useful. 1. UTF-8 encoded single character A. ?term=%C3%84rzteblatt "Drzteblatt" 2. Invalid character codes (ASCII control character(s), but not valid ISO-8859-1 or UTF-8) A. ?term=%02%03cat 3. Non UTF-8, ISO-8859-1, single character A. ?term=%C4rzteblatt "Drzteblatt" 4. Invalid byte sequence (not valid utf-8 or iso-8859-1) A. ?term=%C4%83%C4cat 5. Chinese characters, UTF-8 encoded A. ?term=%e4%bd%a0%e5%a5%bd Search box: "??" 6. ISO-8859-1 multi-byte - this sequence starts out looking like UTF-8, but it's not. A. ?term=%c4%A0%c4rzteblatt Search box: "D Drzteblatt" After working with this for a while, we reached the conclusion that it's best to try to strictly enforce the rule that percent-encoding in URLs be UTF-8. In other words, I think it's a bad idea to try to continue to maintain ISO-8859-1 encoded URLs, because it just leads to too many possible problems, that are very hard to debug.
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] decoding percent-escaped , Julian Reschke | Thread | Re: Fw: [xsl] decoding percent-esca, Julian Reschke |
Re: [xsl] Do you have a rock-solid , George Cristian Bina | Date | Re: Fw: [xsl] decoding percent-esca, Julian Reschke |
Month |