Subject: Re: [xsl] Using accented characters in XML From: Mike Brown <mike@xxxxxxxx> Date: Fri, 18 May 2001 13:12:11 -0600 (MDT) |
Alex Black wrote: > The reason it's failing, is because that character need to be encoded as an > 'entity' - I think I have an entities list around here somewhere. > > anyway, that character should be encoded as É That's not an entity, nor an entity reference. It is a character reference. > _watch_out_ with your entities, though - I was trying to use (the > ever present space in html) and my xslt processor barfed on it. I think > sablot has trouble with named entities. I'm not sure if that's a global > problem with xslt processors. is an entity reference. It presumes there is an entity named nbsp that has been defined. In XML there are only 5 predefined entities and thus you can only reference those 5: lt, gt, amp, quot and apos. If you want more you have to declare them in a DTD. XSL is XML, so, this applies equally well to stylesheets, not just "data" XML documents. You could think of it like this: the entity is the replacement text. The entity reference is where you want the text to go. It is the XML parser that makes this substitution, before your XSL processor ever sees it. The complete set of standard character entities as used in HTML and other SGML applications, provided in the form of convenient declarations ripe for inclusion in a DTD, can be found at http://www.oasis-open.org/cover/xml-ISOents.txt To answer the original question, it is an encoding issue that will be solved if he makes sure that his XML document properly declares its actual encoding. I think he's leaving off the encoding declaration and it is defaulting to UTF-8, when in fact the file is iso-8859-1 encoded. As the bytes for Éditez are read in, this is what happens: É d i t e z C9 64 69 74 65 7A <== actual bytes in the file \ / | | | | \ / | | | | <== when interpreted as utf-8... | | | | | | i t e z <== are these characters. |_______________ <== The first 2 bytes are an invalid utf-8 sequence. The 2nd byte would have to be between 80 and BF for the pair to represent Unicode code points between U+0240 and U+027F (some non-characters and some obscure Latin characters, not what was intended) Perhaps the XML parser chose to substitute a "?" for the invalid utf-8 sequence, whereas it should have kicked out a fatal error. It's possible that he fed the parser a character stream (instead of bytes) in which the substitution had already been made. If it is the former, <?xml version="1.0" encoding="iso-8859-1"?> in his XML document will fix the problem. - Mike _____________________________________________________________________________ mike j. brown, software engineer at | xml/xslt: http://skew.org/xml/ webb.net in denver, colorado, USA | personal: http://hyperreal.org/~mike/ XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] Using accented characters, Alex Black | Thread | Re: [xsl] Using accented characters, Alex Black |
Re: [xsl] tough problem: infinit lo, Trevor Nash | Date | Re: [xsl] Using accented characters, Alex Black |
Month |