[xsl] character entities

Subject: [xsl] character entities
From: Joe Barwell <jbar@xxxxxxxx>
Date: Mon, 03 Nov 2008 20:32:17 +1300
Hello people,

xsl 1.0, Firefox 3.0, Zend Search Lucene, php 5.2.6.

I'm having a wee spot of bother with character entities.

What I'm trying to do:

I have data stored in xml files, which I first pass to an xsl template
in order to transform it into a more usable form (technically, I'm
"flattening" it).

This data is then put into fields within a Zend Search Lucene index, via
php (that's why I first "flattened" it).

This index data is then queried (again via php) and the results sent
to/rendered by a browser.

If I put &#241_; (minus the underline character, which I've added so
this email is not mis-parsed) in my original xml, and using
encoding="iso-8859-1" for it and my xsl stylesheet, then my xsl
transforms that into a (Spanish) n character with a tilde on top: q.

If I tell ZSL to index fields using 'iso-8859-1' encoding, my Spanish n
becomes: CB1. If I tell ZSL to index fields using 'utf-8' encoding, my
Spanish n becomes: C1.

I've looked at dpawson on encoding, and Mike Brown's tutorial at
skew.org. They're v. good, but don't quite seem to explain where I'm
going wrong (or more likely, I'm just oblivious to what's under my nose).

I believe I need to prevent all parsers bar the browser at the end from
parsing my "special characters", right? But how?

I have tried putting a dtd with an entity declaration inside my original
xml, but although that works--i.e. using:

<!DOCTYPE wine [
<!ENTITY ntilde "&#241;">
]>

I can then put: &ntilde; inside my xml, this still gets parsed into: q
by my xsl, & then stored as: C1 in lucene, and displayed as: C1 in my
browser.

I've also tried playing around with php's htmlspecialchars() function,
to no avail.

Latest effort: I tried using encoding="utf-8" for all levels: my
original xml, my xsl output, and the input to ZSL's index, & I also
saved my xml file as utf-8 format, and used the Spanish n inside my xml,
i.e. q rather than &#241;. Doing that, the Spanish n was preserved
through the xsl output, but ZSL stores it as: C1, & that's also how my
browser displays it.

I've run out of ideas. Any suggestions? Ta.

Joe

Current Thread