As far as I can see, there are two ways (not including any binary/base64
based methods).
1) convert-to/ensure well-formedness and confine to a dedicated element
2) shove it into a CDATA section and pretend there is no markup there
Option 1:
has the disadvantage of implementing a pre-processor such as HTML tidy
to try and ensure well-formedness. Even then, there is no guarentee that
the process won't work and the input will be rejected.
On the other hand, there is the advantage of being able to access the
content semantics provided by the HTML markup.
In the past of have done this and then done an identity transform on the
subtree (HTML) of the dedicated HTML content element.
ie.
<xsl:template match="//myXHtmlContent//node() | @*" mode="copyXHtml">
<xsl:copy>
<xsl:apply-templates
select="node()|@*"
mode="copyXHtml"
/>
</xsl:copy>
</xsl:template>
Using this allows to me "process" some of the markup and apply rules
such as deleting and <blink/> elements or whatever.
Option 2:
has the advantage of safely accepting *all* [suitably encoded] content
provided the "]]>" character sequence is escaped. The disadvantage is
that the content is now dead-end data. Also, when transforming it with
XSLT, you have to remember to "disable-outpute-scaping" if you plan on
sending that HTML content to a browser for rendering.
So there are disadvantages with both obptions. I'd like to know how
other people have approached this problem and I'm keen on any advice.
Particularly if people have used option1, what is the best way to ensure
ad-hoc HTML becomes well-formed (assuming you have no control over the
composition domain).
--
Terence Kearns ~ ph: +61 2 6201 5516
IT Database/Applications Developer
Enterprise Information Systems
Client Services Division
University of Canberra
www.canberra.edu.au
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list