RE: [xsl] Ingoring HTML - A Solution

Subject: RE: [xsl] Ingoring HTML - A Solution
From: Jay Burgess <lists@xxxxxxxxxxx>
Date: Tue, 21 Jun 2005 08:01:38 -0700
I thought I'd post a solution to my request last week to remove "HTML tags" from
a block of XML.  There may be a better way to do this, but this seems to work in
my case. Thanks for everyone's input.

<xsl:template name="strip-HTML">
    <xsl:param name="text"/>
    <xsl:choose>
        <xsl:when test="contains($text, '&gt;')">
            <xsl:choose>
                <xsl:when test="contains($text, '&lt;')">
                    <xsl:value-of select="substring-before($text, '&lt;')"/>
                </xsl:when>
                <xsl:otherwise>
                    <xsl:value-of select="substring-before($text, '&gt;')"/>
                </xsl:otherwise>
            </xsl:choose>
            <xsl:call-template name="strip-HTML">
                <xsl:with-param name="text" select="substring-after($text,
'&gt;')"/>
            </xsl:call-template>
        </xsl:when>
        <xsl:otherwise>
            <xsl:value-of select="$text"/>
        </xsl:otherwise>
    </xsl:choose>
</xsl:template>

Jay

| Jay Burgess [Vertical Technology Group]
| "Essential Technology Links via RSS"
| http://www.vtgroup.com/

> Re: [xsl] Ingoring HTML
> Subject: Re: [xsl] Ingoring HTML
> From: "Sam D. Chuparkoff" <sdc@xxxxxxxxxx>
> Date: Fri, 17 Jun 2005 13:39:59 -0700
> 
> On the dangerous side, I'd try something like:
> 
> perl -ne '$c.=$_;eof&&($c=~s/&lt;(([^<>](?!&lt;))*?)&gt;//sg&print$c);'
> foo.xml
> 
> Because it will probably be fine. For extra danger points, you can put
> it in a Makefile with no comment.
> 
> You should be able to do something similar with xsl, but of course this
> isn't very safe, and I think it would be a lot more complicated.
> 
> s/&lt;(([^<>](?!&lt;))*?)&gt;//sg;
> 
> This is '&lt;' some text '&gt;' with no intervening '&lt;', '<', or '>'
> replaced with nothing. I thought about actually trying to turn this
> content into xml, but note there's no close quote on that style
> attribute! Watch out!
> 
> sdc
> 
> On Fri, 2005-06-17 at 15:13 -0500, Jon Gorman wrote:
> On 6/17/05, Jay Burgess <lists@xxxxxxxxxxx> wrote:
> > I apologize if this is in the FAQ, but I've searched and can't find it.  (I'm
> > kind of new to XSL, so I may just have not seen it.)
> 
> This is a faq of sorts, but I had a little bit of a difficult time
> finding an answer to it in Dave Pawson's FAQ as well.  Of course, I
> just did a quick glance.  I'd recommend skimming the the CDATA section
> as well.
> 
> > 
> > I've got some XML that contains HTML-formatted text.  For example:
> > 
> > <title>&lt;SPAN style="font-size: 13pt; font-family: Verdana; &gt;The
> > &lt;b&gt;Text&lt;/b&gt; That I Want&lt;/SPAN&gt;</title>
> > 
> 
> "HTML-formatted text" is a little bit nonsensical.  HTML itself says
> that &lt; is meant as a stand-in for <, so when you have it it's not a
> tag.  Since namespaces were rather slow to get off to start, we ended
> up seeing people put so-called "HTML" in XML *cough* RSS *cough*.  But
> to any XML application, this is one big chunk of text.
> 
> So, some possible advice:
> 
> 1) if you can change the input format so that it uses namespaces and
> actually embeds real XHTML into the documents you're creating, do so. 
> Or at least have it be an option.
> 
> 2) If you can't do that, I'm sure you can find a more general solution
> if you hunt through the archives.  The essential solution will
> probably be along the lines of looking for &lt; and &gt;s and throwing
> any text in them out via some of the XPATH/XSLT string functions. 
> Might be much easier with XSLT 2.0
> 
> 3) It may be possible with a combination of d-o-e and doing multiple
> transformations, regex scripting or other techniques to replace the
> various &lt; and &gt; in certain elements but not others, then
> reprocess that document through your final stylesheet.  Of couse, this
> makes it slightly dangerous.
> 
> Dig through the archives there might be a more general solution
> already done or someone else will be able to give you one instead of
> just giving you some ranting.  (I blame Friday afternoon and a slow
> server for my current long-winded explanation why this type of
> embedding is evil).
> 
> Short answer, it's probably not difficult as long as it's relatively
> straightforward.  If the "html" inside the xml is complex at all or
> you are using &lt; in other places, you might have difficulty.
> 
> Extremely simple if you can just have the input source use namespaces
> and you're comfortable with how XSLT deals with namespaces.
> 
> Jon Gorman

Current Thread