RE: [xsl] problem with processing CDATA tags in xml

Subject: RE: [xsl] problem with processing CDATA tags in xml
From: Wendell Piez <wapiez@xxxxxxxxxxxxxxxx>
Date: Thu, 08 Apr 2010 11:21:46 -0400
At 09:07 AM 4/8/2010, Mike wrote:
> Their product schema does not allow a <Value> to have
> subtags... That's why they use CDATA.

If they want Value to have subtags [properly, child elements] and the schema
does not allow Value to have subtags, then either the schema is wrong and
should be changed, or they are wrong and they should not be trying to create
the subtags.

Quite so. The problem here is the input, the assumptions that go into its design, and the way that design has chosen to pretend requirements don't exist where they are inconvenient, until later, when they are needed but things have already been bollixed.


The use of CDATA marked sections as a syntactic feature is a giant red herring here.

Sometimes, well-meaning engineers want to avoid having markup inside elements so that it can be, for example, stored straightforwardly in a database that doesn't know how to handle XML.

Then someone decides they want markup after all, so it is snuck in using escaping -- pseudo-markup -- disguising it to get past the parser and into text fields. Using a CDATA marked section for this is common, but unfortunate, since it seems to be different from escaping the markup one character at a time (i.e, '&lt;tag&gt;') but isn't -- apart from being a bit neater and easier for a human being to read -- which then gets everyone all hung up when it is not preserved through transformation (even though the XML parser treats the two representations identically).

To complicate matters further, the CDATA marked section is generated using a vulnerable technique like disable-output-escaping to force CDATA delimiters into serialized output. Not only does this introduce an architectural dependency on the serialization process (which in itself might not be a problem in one's own case), but it breaks as soon as
actual markup delimiters are present in the data, since a perfectly admissable bit of text like


text including less-than "<"

is wrapped in pseudo-markup and serialized, leading to

&;lgt;tag&gt;text including "&lt;"&lt;/tag&gt;

or the exactly equivalent

<![CDATA[<tag>text including "<"</tag>]]>

which is no better, of course, as neither can be parsed as soon as the data is unescaped and treated as if it were XML (which it isn't, and never really was). And this in turn leads to extra fun, like double-escaping.

I think the lesson is twofold. First, if you really need XML at both ends, then don't use a process in the middle that does any sort of arbitrary text munging, most especially "tagging" (even or especially introducing delimiters for CDATA marked sections), without also taking care with the necessary character escaping. And second, if you really really have to do this, then you have to be extra sure -- by a combination of correct handling, validation of content and perhaps defensive measures like permissive (HTML) parsing -- that your fake XML doesn't collapse on you when you try to lean on it.

Cheers,
Wendell



======================================================================
Wendell Piez                            mailto:wapiez@xxxxxxxxxxxxxxxx
Mulberry Technologies, Inc.                http://www.mulberrytech.com
17 West Jefferson Street                    Direct Phone: 301/315-9635
Suite 207                                          Phone: 301/315-9631
Rockville, MD  20850                                 Fax: 301/315-8285
----------------------------------------------------------------------
  Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================

Current Thread