RE: [xsl] Multiple CDATA tags...again

Subject: RE: [xsl] Multiple CDATA tags...again
From: "Michael Kay" <mike@xxxxxxxxxxxx>
Date: Tue, 10 May 2005 11:28:23 +0100
This CDATA problem is odd but it's essentially a distraction. The root cause
of your problem is that you're getting some very peculiar XML out of the
database.

I don't know it this is the fault of the database vendor - it's entirely
possible that the rot started with the data that was put into the database
in the first place. You should be trying to identify where the special
characters such as ampersand got double-escaped, and fix the problem at its
origin.

Meanwhile, if you want to tidy up the rubbish that you're getting from the
database, I would think a good start would be to get rid of the
double-escaping using something like:

<xsl:template match="text()">
  <xsl:variable name="doc">
    <x><xsl:copy-of select="."/></x>
  </xsl:variable>
  <xsl:value-of select="saxon:parse($doc)"/>
</xsl:template>

That's a Saxon-specific solution of course, but it's probably the easiest.

Michael Kay
http://www.saxonica.com/


> -----Original Message-----
> From: mylistaddress@xxxxxxxxxx [mailto:mylistaddress@xxxxxxxxxx] 
> Sent: 10 May 2005 02:03
> To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
> Subject: RE: [xsl] Multiple CDATA tags...again
> 
> Hi,
> Thanks for responding. I am pretty much ready to throw
> myself off of a bridge...but I guess I can't complain
> about learning on the job.
> 
> OK, here's the deal. I am sending XML requests via Java
> 1.4 to a library DB called STAR XML (made by Cuadra)
> which sends back a very verbose XML response of a news
> item. I have no control over the format of the output.
> I was able to make sense out of it (thanks to your
> responses) and transform it into a format more
> acceptable to the Verity search indexing spider.
> 
> When the output from STAR XML is HTML, the < and > tags
> are converted to &lt; and &gt; and so on. Oddly it
> appears to also convert a quote as &amp;quot; instead
> of &quot;. When I try to index the resulting XML
> document without placing CDATA tags (not really a tag,
> right?) around the content, the indexer fails.
> The content also contains [ and ] and non english text.
> 
> So, I added the cdata-section-elements declaration to
> my xsl:output and this is when I encountered the
> multiple cdata tags. At first i suspected they appeared
> wherever there is a line-break, but this does not
> appear to be the case. 
> 
> Here is a portion of the XML response from STAR XML:
> <Field outputName="TEXT">
> 2010 &amp;quot;We
> respectfully Wish the health of the great leader
> [yo&apos;ndude] Comarade Big John Il 
> </Field>
> 
> Here is a portion of the XSL dealing with the TEXT
> element:
> <xsl:output method="xml" omit-xml-declaration="no"
> indent="yes" cdata-section-elements="TEXT" />
> <xsl:strip-space elements="*" />
> ...
> <xsl:template match="Field">
> <xsl:if test="contains ('TEXT', @OutputFieldName)">
> <xsl:element name="{@OutputFieldName}">
> <xsl:apply-templates/>
> </xsl:if>
> </xsl:template>
> 
> Resulting XML:
> <TEXT>
> <![CDATA[2010 &quot;We
>      ]]><![CDATA[       Respectfully Wish
> Hea]]><![CDATA[lth of the great leader
>     ]]><![CDATA[      [yo'ndude] Brother ]]><![CDATA[  
> Big John Il]      ]]>
> </TEXT> 
> 
> As you can see, the CDATAs are appearing all over the
> place. This is just a small clip. The actual doc has
> dozens. Also notice how the &quot; (no more &amp;
> before the quot;) appear now. Do I have to transform
> them again? My literal [ and ] are intact.
> 
> I visited dpawson.co.uk and read up on the doe stuff,
> but am still stuck. Could anyone recommend a book? XSLT
> cookbook? I borrowed the O'reiley XML hack (and noticed
> your name) but it is slim on xsl.
> 
> Thanks so much for any help.
> 
> JW

Current Thread