RE: [xsl] nbsp is not that hard, folks

Subject: RE: [xsl] nbsp is not that hard, folks
From: "Américo Albuquerque" <aalbuquerque@xxxxxxxxxxxxxxxx>
Date: Sat, 9 Nov 2002 11:53:14 -0000
Hi there.
So, what you are saying is that &nbsp; is to XML and HTML has "#define
nbsp" is to C??

-----Original Message-----
From: owner-xsl-list@xxxxxxxxxxxxxxxxxxxxxx
[mailto:owner-xsl-list@xxxxxxxxxxxxxxxxxxxxxx] On Behalf Of Mike Brown
Sent: Friday, November 08, 2002 7:13 AM
To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
Subject: [xsl] nbsp is not that hard, folks


Brian Grainger wrote:
> If you're trying to escape &nbsp; in a document encoded as UTF-8, you
> have to use Unicode escaping of the UTF-8 representation of the
> entity. In this case, &nbsp; is equal to &#160;, and &#160; encoded as

> UTF-8 is \u00A0.

Good grief. No, you have your terminology badly mixed up, and you're
throwing in an irrelevant notation. "&nbsp;" "&#160;" and "\u00A0"  have
nothing, NOTHING to do with UTF-8. There is something about nbsp that
just confuses the heck out of people. I think it must be the fact that
it looks like a space, and that you don't have an nbsp key on your
keyboard.

OK, read this.

1. There is a character -- an abstract unit in a "script" (a writing
system;
we are using Latin right now) -- called NO-BREAK SPACE by the Unicode
Standard and ISO/IEC 10646. Unicode and ISO/IEC 10646 assign this
character an integer number, 160, which is A0 in hex. We say Unicode all
the time around here, but
we mean ISO/IEC 10646 because that's what the XML and HTML specs
reference.
The two standards share the same character repertoire and numbering so
there's
no harm.

2. UTF-8 is an encoding scheme that provides a way of representing any
of the approximately 1.1 million possible abstract characters in Unicode
as a sequence of 1 to 4 bytes. The UTF-8 representation of the Unicode
character 160 (no-break space), is the pair of bytes C2 A0, in that
order. In contrast, iso-8859-1 is a character map that provides a way of
representing the first 256 Unicode characters as a single byte. us-ascii
is an even more limited set
of just the first 128, mapped to a single byte.

3. This thing:  \u00A0
  - is a sequence of 6 bytes (ASCII bytes for slash, u, zero, zero, A,
zero);
  - has special meaning in a programming language like Java or Python,
     where it is essentially a macro for the no-break space character;
  - is used when representing the character directly as encoded bytes is
     impractical or impossible.

4. This thing:  &#160;
or this thing:  &#xA0;
  - is to SGML applications like HTML and XML what \u00A0 is to Java &
Python;
  - is called a character reference (or "numeric character reference").

5. This thing:  &nbsp;
  - is to SGML applications like HTML and XML an "entity reference";
  - refers to an entity (a separate collection of information) named
nbsp;
  - depending on the circumstances, is intended to be treated by the
     XML parser or HTML user agent as equivalent to the entity's
     "replacement text";
  - is, in HTML, predefined to have the replacement text of just one
     character, the no-break space;
  - is not defined by default in XML.

6. The thing here in between the quotes:   " "
  - is byte 0xA0;
  - is intended to be a no-break space because this email is iso-8859-1
     encoded;
  - has exactly the same meaning in an XML document as &#160;.

   - Mike
________________________________________________________________________
____
  mike j. brown                   |  xml/xslt: http://skew.org/xml/
  denver/boulder, colorado, usa   |  resume:
http://skew.org/~mike/resume/

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list



 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread