Re: [xsl] Ingoring HTML

Subject: Re: [xsl] Ingoring HTML
From: "Sam D. Chuparkoff" <sdc@xxxxxxxxxx>
Date: Fri, 17 Jun 2005 13:39:59 -0700
On the dangerous side, I'd try something like:

perl -ne '$c.=$_;eof&&($c=~s/&lt;(([^<>](?!&lt;))*?)&gt;//sg&print$c);'
foo.xml

Because it will probably be fine. For extra danger points, you can put
it in a Makefile with no comment.

You should be able to do something similar with xsl, but of course this
isn't very safe, and I think it would be a lot more complicated.

s/&lt;(([^<>](?!&lt;))*?)&gt;//sg;

This is '&lt;' some text '&gt;' with no intervening '&lt;', '<', or '>'
replaced with nothing. I thought about actually trying to turn this
content into xml, but note there's no close quote on that style
attribute! Watch out!

sdc

On Fri, 2005-06-17 at 15:13 -0500, Jon Gorman wrote:
> On 6/17/05, Jay Burgess <lists@xxxxxxxxxxx> wrote:
> > I apologize if this is in the FAQ, but I've searched and can't find it.  (I'm
> > kind of new to XSL, so I may just have not seen it.)
> 
> This is a faq of sorts, but I had a little bit of a difficult time
> finding an answer to it in Dave Pawson's FAQ as well.  Of course, I
> just did a quick glance.  I'd recommend skimming the the CDATA section
> as well.
> 
> > 
> > I've got some XML that contains HTML-formatted text.  For example:
> > 
> > <title>&lt;SPAN style="font-size: 13pt; font-family: Verdana; &gt;The
> > &lt;b&gt;Text&lt;/b&gt; That I Want&lt;/SPAN&gt;</title>
> > 
> 
> "HTML-formatted text" is a little bit nonsensical.  HTML itself says
> that &lt; is meant as a stand-in for <, so when you have it it's not a
> tag.  Since namespaces were rather slow to get off to start, we ended
> up seeing people put so-called "HTML" in XML *cough* RSS *cough*.  But
> to any XML application, this is one big chunk of text.
> 
> So, some possible advice:
> 
> 1) if you can change the input format so that it uses namespaces and
> actually embeds real XHTML into the documents you're creating, do so. 
> Or at least have it be an option.
> 
> 2) If you can't do that, I'm sure you can find a more general solution
> if you hunt through the archives.  The essential solution will
> probably be along the lines of looking for &lt; and &gt;s and throwing
> any text in them out via some of the XPATH/XSLT string functions. 
> Might be much easier with XSLT 2.0
> 
> 3) It may be possible with a combination of d-o-e and doing multiple
> transformations, regex scripting or other techniques to replace the
> various &lt; and &gt; in certain elements but not others, then
> reprocess that document through your final stylesheet.  Of couse, this
> makes it slightly dangerous.
> 
> Dig through the archives there might be a more general solution
> already done or someone else will be able to give you one instead of
> just giving you some ranting.  (I blame Friday afternoon and a slow
> server for my current long-winded explanation why this type of
> embedding is evil).
> 
> Short answer, it's probably not difficult as long as it's relatively
> straightforward.  If the "html" inside the xml is complex at all or
> you are using &lt; in other places, you might have difficulty.
> 
> Extremely simple if you can just have the input source use namespaces
> and you're comfortable with how XSLT deals with namespaces.
> 
> Jon Gorman

Current Thread