At 09:07 AM 4/8/2010, Mike wrote:
> Their product schema does not allow a <Value> to have
> subtags... That's why they use CDATA.
If they want Value to have subtags [properly, child elements] and the schema
does not allow Value to have subtags, then either the schema is wrong and
should be changed, or they are wrong and they should not be trying to create
the subtags.
Quite so. The problem here is the input, the assumptions that go into
its design, and the way that design has chosen to pretend
requirements don't exist where they are inconvenient, until later,
when they are needed but things have already been bollixed.
The use of CDATA marked sections as a syntactic feature is a giant
red herring here.
Sometimes, well-meaning engineers want to avoid having markup inside
elements so that it can be, for example, stored straightforwardly in
a database that doesn't know how to handle XML.
Then someone decides they want markup after all, so it is snuck in
using escaping -- pseudo-markup -- disguising it to get past the
parser and into text fields. Using a CDATA marked section for this is
common, but unfortunate, since it seems to be different from escaping
the markup one character at a time (i.e, '<tag>') but isn't --
apart from being a bit neater and easier for a human being to read --
which then gets everyone all hung up when it is not preserved through
transformation (even though the XML parser treats the two
representations identically).
To complicate matters further, the CDATA marked section is generated
using a vulnerable technique like disable-output-escaping to force
CDATA delimiters into serialized output. Not only does this introduce
an architectural dependency on the serialization process (which in
itself might not be a problem in one's own case), but it breaks as soon as
actual markup delimiters are present in the data, since a perfectly
admissable bit of text like
text including less-than "<"
is wrapped in pseudo-markup and serialized, leading to
&;lgt;tag>text including "<"</tag>
or the exactly equivalent
<![CDATA[<tag>text including "<"</tag>]]>
which is no better, of course, as neither can be parsed as soon as
the data is unescaped and treated as if it were XML (which it isn't,
and never really was). And this in turn leads to extra fun, like
double-escaping.
I think the lesson is twofold. First, if you really need XML at both
ends, then don't use a process in the middle that does any sort of
arbitrary text munging, most especially "tagging" (even or especially
introducing delimiters for CDATA marked sections), without also
taking care with the necessary character escaping. And second, if you
really really have to do this, then you have to be extra sure -- by a
combination of correct handling, validation of content and perhaps
defensive measures like permissive (HTML) parsing -- that your fake
XML doesn't collapse on you when you try to lean on it.
Cheers,
Wendell
======================================================================
Wendell Piez mailto:wapiez@xxxxxxxxxxxxxxxx
Mulberry Technologies, Inc. http://www.mulberrytech.com
17 West Jefferson Street Direct Phone: 301/315-9635
Suite 207 Phone: 301/315-9631
Rockville, MD 20850 Fax: 301/315-8285
----------------------------------------------------------------------
Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================