Re: [xsl] nbsp fails transformation

Subject: Re: [xsl] nbsp fails transformation
From: Abel Braaksma <abel.online@xxxxxxxxx>
Date: Wed, 10 Aug 2011 17:08:16 +0200
If someone sends you a document that isn't well-formed XML, the best strategy is to get the people who produced it to mend their ways.

True. However, having &nbsp; in an XML file and finding out that all of a sudden XML is not XML anymore must be among the most frequent unpleasant surprises fresh XML programmers have to deal with. I believe it was among one of my first questions to this list as well. And my first reaction was: that cannot be, everybody knows &nbsp;, how can it _not_ be XML?


The thing is, XML is a very generic and expandable language, and entities is one thing that can be expanded upon (above the five that are always allowed: &lt; &gt; &amp;, &apos and &quot;). This is done by declaring entities in DTD declarations like Patrick suggested, or can be done by using an external DTD file and link to it.

If your input comes from XHTML or HTML, this happens often. The fix is to use the original doctype declaration and make sure that the DTD's it refers to are available. That way other entities like &mdash;, &uml; &copy; are also recognized in the majority of cases.

You can find the declaration of all these entities here: http://www.w3.org/TR/xhtml1/dtds.html#a_dtd_Latin-1_characters, it also shows a typical declaration for use in XML. Download the file at http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent, use it locally to refer to it and you can work with almost all XHTML/HTML input, as long as the rest is well-formed.

Kind regards,
Abel Braaksma



------------------------------------------------------------------------
From: 	Michael Kay <mike@xxxxxxxxxxxx>
Sent: 	Wednesday, August 10, 2011 10:19:17 AM
To: 	xsl-list
Cc: 	
Subject: 	Re: [xsl] nbsp fails transformation




Now since i can't even transform those files i can't throw those
entities out.

How do i handle this !?

If someone sends you a document that isn't well-formed XML, the best strategy is to get the people who produced it to mend their ways. Once you start accepting bad XML (or non-XML, as I prefer to call it), all the benefits of using XML for interchange quickly become lost, and you might as well revert to using some proprietary interchange format.

Michael Kay
Saxonica

Current Thread