Re: GOTCHA!

Subject: Re: GOTCHA!
From: "Oren Ben-Kiki" <oren@xxxxxxxxxxxxx>
Date: Fri, 15 Jan 1999 12:31:52 +0200
James Clark <jjc@xxxxxxxxxx> wrote:

>I wrote:
>> Now this is a hack! You stepped on another XT bug here - or a specs bug.
I
>> checked the following:
>>
>> <xsl:template match="A">
>> <xsl:pi name="JavaScript">
>>  <xsl:text><![CDATA[<&>]]></xsl:text></xsl:pi>
>> </xsl:template>
>>
>> And got in the result:
>>
>> <?JavaScript <&>>
>
>Not an XT bug or a specs bug. You would get
>
><?JavaScript <&>?>

>
>which *is* well-formed. Remember that in XML a PI is terminated by ?>.

Checked again. XT version 0.5 emits '>' and not '?>' at an end of a PI.
Also, it would emit '?>' inside the content without converting it to '? >'
as per the spec. But you are right - there are three constructs which avoid
markup, comments, PIs and CDATA, and we already have access to two of them.
I'd still argue that being able to generate all possible XML text files (as
opposed to all possible XML in-memory representations) has its value, but I
understand why that would be lower priority.

>How often will you get ?> in Javascript? Less often than ]]> I suspect.

I believe that '>?' isn't valid JavaScript. It might appear in strings, of
course... But strings in embedded scripts are a whole painful issue by
itself :-)

>> Or could I expect that an XML/XSL processor to be smart enough to use
>> different character quoting rules within a <SCRIPT> tag?
>
>Right.


I've tried to understand how this works - it does work, to my great
surprise. I went back to the documentation...

The XML spec insists that unadorned '<' and '&' can appear only inside CDATA
sections, a PI, or a comment (section 2.4). Section 2.7 describes CDATA
sections and makes it clear they always begin with "<![CDATA[" and end with
"]]>". Section 3.2 discusses element types. It lists '#PCDATA' as a possible
type in 3.2.2 (without giving its definition, or even a link to somewhere
where it is defined - strange). It does _not_ list 'CDATA' as a valid type.
XSL is expected to always emit valid XML. And yet...

The HTML 4.0 does specify CDATA as the value type for the SCRIPT element
(and many other things), with a link to the _SGML_ standard. Obviously HTML
4.0 isn't XML. Yet it is a valid result-ns for XSL, and the XT processor
emits what seems to be CDATA, for SCRIPT tags. Should be illegal...

The explanation is in section 2.2. In an editorial note it states that it is
possible to use the result-ns to specify non-XML output, and lists HTML as
an example. Elthough this is just an editorial note, _it explicitly caters
to non-XML output_. Who said the W3C isn't responsive? They are just being
shy about it, so they put in in small letters :-) In fact, it is a very
elegant way of solving the problem - it limits the damage to a single
attribute of a single tag. Neat!

Even better, this trick has the potential to settle this issue once and for
all. Consider adding an 'http://www.w3c.org/TR/rec-cdata' result-ns. This
result-ns would specify that all output elements have the content type
'CDATA', so that any text emitted by the stylesheet would not be marked up,
ever. This can't be done in an XML DTD, but neither can the HTML one.
Stylesheets using this result-ns would probably not bother to generate
elements, anyway; by using just <xsl:text> etc. they'll generate output in
an arbitrary formats - without changing anything in the XSL standard itself.

>> It would also have
>> to examine the LANGUAGE attribute for it...
>
>Huh? SCRIPT in HTML 4.0 is an SGML CDATA element, which means that when
>outputting it, & and < must not be escaped to &amp; and &lt;.  This is
>independent of the scripting language.


Right. Sorry. I was thinking about the quoted strings problem - the need to
take some text and quote it so that it may be safely embedded in a
scripting language string; this would be different between scripting
languages. It's really a variant of the arbitrary text formatting issue. If
<xsl:ecmascript-string> is unacceptable, how about adding a perl-like regexp
capability to <xsl:text>? <xsl:text transform='s:["\\]:\\&amp;:g'> would do
wonders :-)

BTW, a final hack which works in XT, if the result-ns is HTML, and would
probably work in other processors as well:

<xsl:template match="...">
<SCRIPT>
<xsl:text><![CDATA[</SCRIPT>]]>
Anything you want - &lt;, &amp;, &gt;
<![CDATA[<SCRIPT>]]></xsl:text>
</SCRIPT>
</xsl:template>

Emits:

<SCRIPT></SCRIPT>
Anything you want: <, &, >
<SCRIPT></SCRIPT>

Where there's a will, there's a way :-)

    Oren Ben-Kiki


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread