Re: [xsl] RSS feeds and disable-output-escaping="yes"

Subject: Re: [xsl] RSS feeds and disable-output-escaping="yes"
From: David Carlisle <davidc@xxxxxxxxx>
Date: Fri, 6 May 2005 12:03:31 +0100
> It's likely that the HTML isn't well-formed XML, so you're going to have to
> extract it as a string, put it through the tidy utility, parse it, and get
> it back into the stylesheet in tree form before you can manipulate it at the
> node level. 
> 
> I would tend to do this as a non-XSLT stage in a processing pipeline; you
> could also do it by calling out to an extension function.
> 

Of course Michael is probably still using XSLT1. Some of us have moved
up to XSLT2 (There's a nice implementation called saxon8...) in which
case you can handle a fair amount of "non well formed html as a string"
just using XSLT2 functions.


eg


h.xml:


<greeting><![CDATA[<P>Hello, <i>world!</P>]]></greeting>


h.xsl:

<?xml version="1.0" encoding="iso-8859-1"?>
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
xmlns:d="data:,dpc"
exclude-result-prefixes="d">

<xsl:import href="http://www.dcarlisle.demon.co.uk/htmlparse.xsl"/>

<xsl:output method="html"/>
<xsl:template match="/">
<html>
<head>
<title>Today's greeting</title>
</head>
<body>
<xsl:copy-of select="d:htmlparse(string(greeting[1]),'',true())/node()"/>
</body>
</html>
</xsl:template>


</xsl:stylesheet>



$ saxon8 h.xml  h.xsl
<html>
   <head>
      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

      <title>Today's greeting</title>
   </head>
   <body>
      <p>Hello, <i>world!</i></p><i></i></body>
</html>



The <i></i> there is an artifact of its html "recovery" mode of
re-opening automatically closed elements (looks like I should improve
that a bit one day), you can turn off that so by changing true() in the
above call to false() then you get

$ saxon8 h.xml  h.xsl
<html>
   <head>
      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

      <title>Today's greeting</title>
   </head>
   <body>
      <P>Hello, <i>world!</i></P>
   </body>
</html>

so now the <i> element has been closed but no lowercasing or other
html-specific transformations have been done, and <i> isn't re-opened.

David




________________________________________________________________________
This e-mail has been scanned for all viruses by Star. The
service is powered by MessageLabs. For more information on a proactive
anti-virus service working around the clock, around the globe, visit:
http://www.star.net.uk
________________________________________________________________________

Current Thread