Re: [xsl] Fw: Select entire XML doc [FURTHER]

Subject: Re: [xsl] Fw: Select entire XML doc [FURTHER]
From: Mike Brown <mike@xxxxxxxx>
Date: Fri, 28 Feb 2003 14:56:47 -0700 (MST)
Karl Stubsjoen wrote:
> Wow... that was most awesome.  Thanks for the help, it really made a lot of
> sense.  And indeed, I do need to be careful of HTML tags becoming malformed.
> Once the XML has been propery serialized in a text area element, what is the
> proper way to deserialize it?

Do you mean you want to turn 

<someXmlData>&lt;tag&gt;chardata&lt;/tag&gt;</someXmlData>

into 

<someXmlData><tag>charadata</tag></someXmlData>

?

...This is a FAQ and is generally beyond the scope of what XML should be used
for, or what XSLT can do without extension functions. But if you insist, you
will need to write an extension function that takes the content of the
someXmlData element (or any string, really), passes it into an XML parser, and
converts the parser's results to a node-set or result tree fragment. See your
XSLT processor docs for how to write an extension function (it varies). Your
processor may already have such a function available (but likely not).

Or do you mean after the HTML has been rendered in the browser, and the user
submits the form having the textarea with the possibly-edited XML? That's a
whole 'nother can of worms, due to encoding issues, which I am all too happy
to write about, although it is technically off-topic for this list.

First, in general, you should not be passing XML around in HTML form data, if
the intent is to have a general-purpose XML editing system, although as long
as you stick to pure ASCII, or just treat it as an uneditable binary file,
then things should be fine.

The problems begin with how form data is handled. A browser transmits the form
data, which is Unicode, encoded as if it were going into a URL. This means
that certain characters in the ASCII range (code points 0 to 127) and all
characters beyond the ASCII range (code points 128 to 1114111) are first
encoded as bytes, then represented as ASCII bytes for the characters "%xx"
where xx is the hexadecimal representation for a byte. The ASCII-range
characters always use the us-ascii encoding as the basis for the %-escaping,
while the non-ASCII characters typically (it's not enforced by any standard)
use the encoding *of the HTML document containing the form from which this
data was submitted*.

So for example if you have in your textarea the character data "¡Hola amigo!",
and the HTML with the form was utf-8 encoded, and the browser user didn't
override the interpreted encoding on their end, then the form will be
submitted using utf-8 as the basis for the %-escaped form data:

  %C2%81Hola%20amigo!

whereas if the HTML were iso-8859-1 encoded, it would be coming through as

  %81Hola%20amigo!

On the receiving end, the form data needs to be decoded. Most servers provide
an API for receiving decoded form data in your application, be it CGI
environment variables or getParameter() methods on HTTP request objects or
what have you. But since most browsers do not communicate the details of what
encoding they used as the basis for the %-escaping, the server makes a guess,
and usually guesses wrong. So for example, while

   %C2%81Hola%20amigo!

unambigously means bytes

   C2 81 48 6F 6C 61 20 61 6D 69 67 6F 21

...the API might mistakenly assume that these are iso-8859-1 and will decode
it for you into the string "À¡Hola amigo!". In fact, this happens quite often.
So you'll have to be prepared to transcode: re-encode the string using the
same encoding that the server assumed, and then decode it using the encoding
that you know the HTML form used (you might send the latter in a hidden form
field). Either that, or pull the raw data out of the HTTP request and properly
decode it yourself.

Once you have the properly decoded string, you can feed it to an XML parser as
a Unicode string, so that the parser will ignore the encoding declaration in
the XML's prolog. If you were to feed the raw bytes (the C2 81 48 etc above)
to the parser, you would have to declare the encoding externally, because
there's a chance that the declaration in the prolog has become innacurate
while it was edited and reencoded.

You didn't know what you were getting into, did you? Like I said, in general,
HTML forms and the server-side APIs for processing them are just not equipped
to be a general-purpose XML editing system, at least not in an idiot-proof
way. The culprits are really HTTP and MIME; HTML is just working around their
restrictions. And browser vendors choose the path of least disruption,
choosing not to implement some of HTML's features that could easily work
around some of these issues (e.g., they do have a way of transmitting encoding
info, but they just don't do it, to "keep people's scripts from breaking").

-- 
  Mike J. Brown   |  http://skew.org/~mike/resume/
  Denver, CO, USA |  http://skew.org/xml/

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread