RE: [xsl] Ingoring HTML

Subject: RE: [xsl] Ingoring HTML
From: Jay Burgess <lists@xxxxxxxxxxx>
Date: Fri, 17 Jun 2005 13:21:56 -0700
Jon,

Thank you very much for all of the information--especially on a Friday
afternoon. :)  You've confirmed that it's not just a flag I set somewhere, so
I'll dig into it and get it solved.

Thanks again.

Jay

-----Original Message-----
From: Jon Gorman [mailto:jonathan.gorman@xxxxxxxxx] 
Sent: Friday, June 17, 2005 3:14 PM
To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
Subject: Re: [xsl] Ingoring HTML

On 6/17/05, Jay Burgess <lists@xxxxxxxxxxx> wrote:
> I apologize if this is in the FAQ, but I've searched and can't find it.  (I'm
> kind of new to XSL, so I may just have not seen it.)

This is a faq of sorts, but I had a little bit of a difficult time
finding an answer to it in Dave Pawson's FAQ as well.  Of course, I
just did a quick glance.  I'd recommend skimming the the CDATA section
as well.

> 
> I've got some XML that contains HTML-formatted text.  For example:
> 
> <title>&lt;SPAN style="font-size: 13pt; font-family: Verdana; &gt;The
> &lt;b&gt;Text&lt;/b&gt; That I Want&lt;/SPAN&gt;</title>
> 

"HTML-formatted text" is a little bit nonsensical.  HTML itself says
that &lt; is meant as a stand-in for <, so when you have it it's not a
tag.  Since namespaces were rather slow to get off to start, we ended
up seeing people put so-called "HTML" in XML *cough* RSS *cough*.  But
to any XML application, this is one big chunk of text.

So, some possible advice:

1) if you can change the input format so that it uses namespaces and
actually embeds real XHTML into the documents you're creating, do so. 
Or at least have it be an option.

2) If you can't do that, I'm sure you can find a more general solution
if you hunt through the archives.  The essential solution will
probably be along the lines of looking for &lt; and &gt;s and throwing
any text in them out via some of the XPATH/XSLT string functions. 
Might be much easier with XSLT 2.0

3) It may be possible with a combination of d-o-e and doing multiple
transformations, regex scripting or other techniques to replace the
various &lt; and &gt; in certain elements but not others, then
reprocess that document through your final stylesheet.  Of couse, this
makes it slightly dangerous.

Dig through the archives there might be a more general solution
already done or someone else will be able to give you one instead of
just giving you some ranting.  (I blame Friday afternoon and a slow
server for my current long-winded explanation why this type of
embedding is evil).

Short answer, it's probably not difficult as long as it's relatively
straightforward.  If the "html" inside the xml is complex at all or
you are using &lt; in other places, you might have difficulty.

Extremely simple if you can just have the input source use namespaces
and you're comfortable with how XSLT deals with namespaces.

Jon Gorman

Current Thread