Re: [xsl] xslt replace special characters

Subject: Re: [xsl] xslt replace special characters
From: Mike Brown <mike@xxxxxxxx>
Date: Mon, 11 Nov 2002 13:38:52 -0700 (MST)
Alice Fan wrote:
> Thanks Greg.  Right in the UI, we want the user to enter their URL. Their 
> URL will most likely have name/value pairs.  Is there an easier way?  There 
> is no otherway of filtering '&' before it gets processed in the XSL?

It doesn't matter if they're entering a URL/URI or not. Any text that you 
intend to put into an XML document needs to be screened, to preserve 
well-formedness / parseability.

1. Always note the following:

- non-XML characters need to be removed or replaced
  (U+0000..U+0008, U+000B, U+000C, U+000E..U+001F, U+D800..U+DFFF,
   U+FFFE..U+FFFF)

- a string is not a URI if it violates URI syntax, so if the text is
   destined for a URI-pseudotype attribute value (like href or src in 
   HTML/XHTML), characters above U+007F should be escaped by writing
   their equivalent UTF-8 bytes as '%xx' for each byte, where xx is the
   hex notation for the byte (though this isn't strictly necessary; a 
   conforming HTML user agent will do this automatically)

- additional translation of ASCII-range characters (U+0000..U+007F) in 
   text destined for URI attributes is not required but is wise, to
   ensure conformance to URI syntax; %-escape everything except
   a-z, A-Z, 0-9, and these: - _ . ! ~ * ' ( ) ; / ? : @ & = + $ , [ ]


2. If and when the XML document exists in serialized form
   (i.e., as a string, not as a DOM object), note the following:

- if the text is not destined for a CDATA section, markup characters '&'
   and '<' need to be escaped

- if the text is destined for a CDATA section, the '>' in ']]>'
   needs to be escaped

- if the text is destined for a comment, it must not contain '--'
   (how you handle such an offense is up to you)

- if the text is destined for an attribute value delimited by apostrophes,
   then apostrophes in the value must be escaped (usually use &apos; unless
   in HTML)

- if the text is destined for an attribute value delimited by quotes,
   then quotes in the value must be escaped (usually use &quot;)

- if the text is destined for a non-URI attribute value, then tab, LF, 
   and CR need to be escaped to facilitate round-tripping

I probably missed one or two cases, but as you can see, you can't just slap
any old text into a document and call it XML...

   - Mike
____________________________________________________________________________
  mike j. brown                   |  xml/xslt: http://skew.org/xml/
  denver/boulder, colorado, usa   |  resume: http://skew.org/~mike/resume/

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread