Re: [xsl] character entities

Subject: Re: [xsl] character entities
From: Jeni Tennison <jeni@xxxxxxxxxxxxxxxx>
Date: Tue, 20 Nov 2001 14:59:29 +0000

>> that lists the mappings between character numbers and entities.
> That assumes that the entities do correspond to characters so that
> you can do this as a linearisation option; just asking that certain
> character numbers are output as entity references. (Of course you
> may think that the subject line above alows you to make that
> assumption, but it's best never to assume anything:-)

Yes indeed.

> Taking a random example from the dtd on which I use that shell
> script, how would you tell the serialiser to output
[snip complex XML entity]
> as &e04BL-a; ?

I think we've had this discussion before, haven't we?

First I'll say that I do think that character entity references are
the cause of 90% of the problems in this area. Of course you do get
instances where people use other entities in source documents,
particularly when dealing with document-oriented XML, but in most
transformations these should be parsed along with the rest of the
document. It's also more than a little tempting to just say use
XInclude or XLink rather than entities, which are *so* last millenium

I think that control over how characters are output would be a good
addition in XSLT 2.0. After all, you get control over which elements
get to have CDATA section content, which is another physical
structure. Saxon's gives you nice control in HTML over whether you
want native characters, character references in decimal or
hexadecimal, or character entity references. And I like the Xalan
technique of pointing to a file describing the mapping between
characters and character entity references; it would be even better if
it could take several files and could interpret DTD syntax. Actually I
notice this is partly covered by Requirement 2.7.

But anyway... For general entities, one option would be to make sure
that you store the XML for the entity as canonical XML, and then do a
text-based substitution of the entity XML on the canonical XML
generated from the result tree, before finally outputting it according
to the xsl:output instructions.

The other possibility is comparing the trees for the result and the
entity. Interestingly, it looks as though the XQuery/XPath 2.0 data
model includes the notion of 'value-equal' which includes deep
equality between node sequences. So possibly you could say that a
sequence of nodes in the result tree should be replaced by a given
entity if the sequence is value-equal to the sequence defined by a
given XML fragment. Very probably that's very time-consuming,
especially for 1000 entities on a long document.

With character entity references quite often you want characters to be
included differently in the input (where they're probably native
characters or character references) to the output (where you want e.g.
HTML character entity references).

But with general entities, I wonder how often you actually want the
result tree to be examined to find whatever entities might be
included. Usually the real problem, as illustrated by your shell
script, is how to get the XSLT processor to pass through entities from
the source document or to include directly entities specified in the
stylesheet. From the stylesheet to the output you could use something
like saxon:entity-ref, which is covered by Requirement 2.8.

>From the source to the output, it's a different matter because as we
know entities aren't in the data model. The only options are to
include them in the data model (which I don't think's going to happen)
or to change them into something that is in the data model (which is
essentially what you're doing with your text substitution).

I'm not exactly sure, but presumably XSLT processors access the
stylesheet before they access the source document - it would make
sense in that it would allow them to build the tree without
whitespace-only text nodes in the first place rather than stripping
them after building it.

So perhaps you could have a switch within the stylesheet - something
like an xsl:input top-level element with a include-entity-references
attribute - that governed whether entity references were included in
the node tree. You could use elements to hold the resolved content of
the entity, just in case you *did* need to have access to it within
the stylesheet, or you could point to the file holding the entity if
it was an external parsed entity (which means you could control within
the stylesheet whether you ever retrieved it or not). Then you could
use a similar instruction to saxon:entity-ref to create the entity
reference in the output by matching on the entity-reference element,
if you wanted.

Of course the trouble with that is that you'd need to make sure that
the XPaths in the stylesheet took into account that a 'child' element
might actually be a grandchild, within one of these entity-reference
elements. And it greatly adds to the size of the node tree (for which
reason I'd say that it shouldn't apply to character entity
references). But you might be happy to put up with that.



Jeni Tennison

 XSL-List info and archive:

Current Thread