Greg,
Hopefully you'll get an answer from a real character-set junkie so you
won't have to rely on me. But as it's late on Friday....
At 06:48 PM 6/7/2002, you wrote:
Why doesn't this XML content: •
produce this output: •
after parsing/xslt in my xhtml document???
Because HTML has a close enough family resemblance to XML that the
presumption is that any string '& a m p ;' in your input (spaces put there
to sanitize for obnoxious mailers), you want to *see* an & character
displayed in your HTML browser, which requires that it be *represented* as
'& a m p ;' in your (conformant) HTML source code (i.e. the serialized
output of your transform).
As I'm sure you saw in the FAQ, the XSLT processor, after a file is parsed,
"sees" a & character (a character no. 38) where there was an escaped
character *reference* '& a m p ;' in your source. This facility allows XML
to use the same character as an open markup delimiter for, of course,
entity references.
Notice for these purposes there's no difference between "•" and
"& so's your mama" -- "•" means "show me '•'" (the
literal, not any character by that name), so it must perforce be
represented as "•" since that's the way to tell an HTML
*application* (browser) to do that.
It's bloody nigh impossible to get my XML parser (Xalan-Java) NOT to
recognize entities except for this one case where recognizing it would
solve all my problems.
Nope, it's recognizing this one too, it's just properly turning it *back*
into an entity when you are serializing the file.
The xsl list FAQ under "Entities" item 13 "Passing Entities through a
Transform" says that all entities are resolved before the transform and
implies the only way to get around this is with a perlscript to strip
entities of their ampersands. This cannot be the whole truth because:
a) xalan won't resolve &#amp; in the above example
The whole idea of changing the & into &#amp; is to stop it from being a
reference (no you're right Xalan won't resolve it), thereby allowing the
transmission of the string unchanged, so it can be twiddled back into the
entity reference. If it had been a reference going in, it would have
disappeared, leaving behind ... the character it had referred to. (Lots of
the time this is actually fine.)
and b) everyone trying to produce html for posting would be screwed by
having XML docs with proper unicode references--nobody could set set
stuff up so cruelly (right?)
Well, actually they had no choice, it was either be cruel to be kind, or
magically uninstall all the browsers ever deployed in the bad old days of
HTML, when browsers cared less about "standards" than about conquering the
universe. (Come to think of it, that would have been nice, I wonder why
they didn't.)
c) In XSLT quickly, there's an example of how to define entities in the
xsl stylesheet using <xsltext> to avoid this (p.90-91)--only you can't
use this technique on a numbered entity because evidently that's not
valid xml so they don't exist, even though they're all over the place.
Who says it's not valid XML? You can refer with a numbered character
reference (entity) to any character allowed in XML.
I know this is an old subject; but after hours of investigating, I still
don't get it. I need to know why the above example doesn't produce the
right numbered entity reference, and what other ways there are to preserve
entities through a transform
You can't. An entity reference cannot be preserved, period. The whole idea
is that a parser will resolve the reference, turning it into the thing you
said it was supposed to be.
That's why the canonical solutions -- such as the Perl pre- and
post-processing massages, are all *workarounds* not solutions. They
basically work by *disguising* the reference as
some-funky-string-not-a-reference. It's like the parser is the bouncer at
the concert and the entity reference is a beer. The Perl is putting your
beer in a paper bag; then when you get to your seat you take it out again.
, and possibly how unicode/numbered entities are defined and can be
redefined. There just has to be a way to do this within xslt. I'm sorry
that I still don't get this--please help anyway, somebody.
*If* you are writing your output to a file -- and always will be -- you can
use a feature supported in some XSL processors that starts with a 'd' and
has three words, two of which are "output" and "escaping" (I forget the
third). But this is *not as honest* a solution as the paper-bag workaround.
At least then you are aware of what you are doing.
Now if browsers weren't broken to begin with none of this would have been a
problem. (HotJava anyone?)
I hope that helps,
Wendell
======================================================================
Wendell Piez mailto:wapiez@xxxxxxxxxxxxxxxx
Mulberry Technologies, Inc. http://www.mulberrytech.com
17 West Jefferson Street Direct Phone: 301/315-9635
Suite 207 Phone: 301/315-9631
Rockville, MD 20850 Fax: 301/315-8285
----------------------------------------------------------------------
Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list