|
Subject: RE: [xsl] Ingoring HTML - A Solution From: Jay Burgess <lists@xxxxxxxxxxx> Date: Tue, 21 Jun 2005 08:01:38 -0700 |
I thought I'd post a solution to my request last week to remove "HTML tags" from
a block of XML. There may be a better way to do this, but this seems to work in
my case. Thanks for everyone's input.
<xsl:template name="strip-HTML">
<xsl:param name="text"/>
<xsl:choose>
<xsl:when test="contains($text, '>')">
<xsl:choose>
<xsl:when test="contains($text, '<')">
<xsl:value-of select="substring-before($text, '<')"/>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="substring-before($text, '>')"/>
</xsl:otherwise>
</xsl:choose>
<xsl:call-template name="strip-HTML">
<xsl:with-param name="text" select="substring-after($text,
'>')"/>
</xsl:call-template>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="$text"/>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
Jay
| Jay Burgess [Vertical Technology Group]
| "Essential Technology Links via RSS"
| http://www.vtgroup.com/
> Re: [xsl] Ingoring HTML
> Subject: Re: [xsl] Ingoring HTML
> From: "Sam D. Chuparkoff" <sdc@xxxxxxxxxx>
> Date: Fri, 17 Jun 2005 13:39:59 -0700
>
> On the dangerous side, I'd try something like:
>
> perl -ne '$c.=$_;eof&&($c=~s/<(([^<>](?!<))*?)>//sg&print$c);'
> foo.xml
>
> Because it will probably be fine. For extra danger points, you can put
> it in a Makefile with no comment.
>
> You should be able to do something similar with xsl, but of course this
> isn't very safe, and I think it would be a lot more complicated.
>
> s/<(([^<>](?!<))*?)>//sg;
>
> This is '<' some text '>' with no intervening '<', '<', or '>'
> replaced with nothing. I thought about actually trying to turn this
> content into xml, but note there's no close quote on that style
> attribute! Watch out!
>
> sdc
>
> On Fri, 2005-06-17 at 15:13 -0500, Jon Gorman wrote:
> On 6/17/05, Jay Burgess <lists@xxxxxxxxxxx> wrote:
> > I apologize if this is in the FAQ, but I've searched and can't find it. (I'm
> > kind of new to XSL, so I may just have not seen it.)
>
> This is a faq of sorts, but I had a little bit of a difficult time
> finding an answer to it in Dave Pawson's FAQ as well. Of course, I
> just did a quick glance. I'd recommend skimming the the CDATA section
> as well.
>
> >
> > I've got some XML that contains HTML-formatted text. For example:
> >
> > <title><SPAN style="font-size: 13pt; font-family: Verdana; >The
> > <b>Text</b> That I Want</SPAN></title>
> >
>
> "HTML-formatted text" is a little bit nonsensical. HTML itself says
> that < is meant as a stand-in for <, so when you have it it's not a
> tag. Since namespaces were rather slow to get off to start, we ended
> up seeing people put so-called "HTML" in XML *cough* RSS *cough*. But
> to any XML application, this is one big chunk of text.
>
> So, some possible advice:
>
> 1) if you can change the input format so that it uses namespaces and
> actually embeds real XHTML into the documents you're creating, do so.
> Or at least have it be an option.
>
> 2) If you can't do that, I'm sure you can find a more general solution
> if you hunt through the archives. The essential solution will
> probably be along the lines of looking for < and >s and throwing
> any text in them out via some of the XPATH/XSLT string functions.
> Might be much easier with XSLT 2.0
>
> 3) It may be possible with a combination of d-o-e and doing multiple
> transformations, regex scripting or other techniques to replace the
> various < and > in certain elements but not others, then
> reprocess that document through your final stylesheet. Of couse, this
> makes it slightly dangerous.
>
> Dig through the archives there might be a more general solution
> already done or someone else will be able to give you one instead of
> just giving you some ranting. (I blame Friday afternoon and a slow
> server for my current long-winded explanation why this type of
> embedding is evil).
>
> Short answer, it's probably not difficult as long as it's relatively
> straightforward. If the "html" inside the xml is complex at all or
> you are using < in other places, you might have difficulty.
>
> Extremely simple if you can just have the input source use namespaces
> and you're comfortable with how XSLT deals with namespaces.
>
> Jon Gorman
| Current Thread |
|---|
|
| <- Previous | Index | Next -> |
|---|---|---|
| RE: [xsl] dynamic document() templa, Arian Hojat | Thread | [xsl] Re: [xml-dev] Indentation usi, David Carlisle |
| RE: [xsl] dynamic document() templa, Michael Kay | Date | Re: [xsl] xsl:include href - relati, Hardy Merrill |
| Month |