Re: Embedding HTML in XML documents using HTML dtd

Subject: Re: Embedding HTML in XML documents using HTML dtd
From: Jeni Tennison <mail@xxxxxxxxxxxxxxxx>
Date: Mon, 09 Oct 2000 12:57:49 +0100
Mila,

>I would like to enable HTML tags within my XML file - using the HTML
>dtd. For example, if there is a list in the XML:
><list>
><UL>
>            <LI>List item 1</LI>
>            <LI>List item 2</LI>
>            </UL>
></list>
>
>What do I have to add to the XML, the DTD and the XSL to be able to
>convert this to a list when I generate an HTML file using xalan? I would
>like to use the HTML dtd to make this work.
>
>I am currently using the whole section starting at <UL> to </UL> within
>a ![CDATA[    ]] element, and reading the output in xsl with 
> <xsl:for-each select="list">
><xsl:value-of disable-output-escaping="yes" select="."/>
></xsl:for-each>
>This works, but we would prefer the solution of using the HTML dtd,
>except I am not sure how to implement that.

Certainly the CDATA section solution is less than optimal!

There are two approaches to this problem: either you can use what you know
about your XML to say "the content of a 'list' element is HTML and should
be copied directly" or you can explicitly put the UL and LI elements in the
HTML namespace within the source XML, and then within your stylesheet say
"all HTML elements should be copied".

In either case, you need to know about the xsl:copy and xsl:copy-of
elements.  xsl:copy copies the current node, but none of its contents or
attributes.  xsl:copy-of copies a node set that you select, including all
of its contents and any attributes or namespace nodes.

The first is simpler but less extensible: when you find a 'list' element,
you make a copy of its element content:

<xsl:template match="list">
  <xsl:copy-of select="*" />
</xsl:template>

Given an input of:

<list>
  <UL>
    <LI>List item 1</LI>
    <LI>List item 2</LI>
  </UL>
</list>

This will give:

<UL>
    <LI>List item 1</LI>
    <LI>List item 2</LI>
  </UL>

The problem is that you have to do something similar anywhere else where
you have HTML elements within your XML elements and you want them copied.
It might be that 'lists' are the only elements where HTML elements occur,
in which case this is the easiest solution.

The second solution is to use namespaces to explicitly say that the UL and
LI elements are HTML elements.  To do that, you associate a namespace
prefix (a string that you can choose) to a namespace name (a string that
you can choose, but that should probably be a URI pointing to a DTD,
schema, or human-readable documentation about the elements you're using).
For common XML dialects like HTML, there is usually a namespace name
defined somewhere, and using that namespace name could enable you to use
other people's stylesheets that also process elements in that namespace.
In the case of XHTML, the namespace name is:

  http://www.w3.org/1999/xhtml

You can associate the prefix 'html' with this namespace name using a
namespace attribute:

  xmlns:html="http://www.w3.org/1999/xhtml";

You don't have to use the prefix 'html' - you can use anything you want.

This attribute should be put on an element that is an ancestor of the HTML
elements (or is itself an HTML element).  A namespace attribute makes a
namespace 'in scope' (i.e. usable) for the element that it's on and all its
descendents.  Usually you'd put it on your document element (i.e. the
top-most element).  In your case, you could put it on the 'list' element:

<list xmlns:html="http://www.w3.org/1999/xhtml";>
  ...
</list>

Within the 'list' element, any elements that are within the HTML namespace
need to be given qualified names to indicate that fact.  You do this by
adding the prefix (i.e. 'html') and a colon before the name of the element,
so:

<list xmlns:html="http://www.w3.org/1999/xhtml";>
  <html:UL>
    <html:LI>List item 1</html:LI>
    <html:LI>List item 2</html:LI>
  </html:UL>
</list>

As a quick aside, XHTML defines that element names should be in lower case,
so I'd make this:

<list xmlns:html="http://www.w3.org/1999/xhtml";>
  <html:ul>
    <html:li>List item 1</html:li>
    <html:li>List item 2</html:li>
  </html:ul>
</list>

for compliance to that standard.

In terms of the DTD for the source XML, DTDs and namespaces don't mix
particularly well: you have to use the same qualified names within the DTD
as you use within your XML, which means that the prefix is fixed within the
DTD.  [You could get around this using a parameter entity.]  If you have to
validate your source XML against a DTD, then the DTD should hold something
like:

<!ELEMENT list (html:ul)>
<!ATTLIST list
  xmlns:html CDATA #FIXED 'http://www.w3.org/1999/xhtml'>

<!ELEMENT html:ul (html:li+)>
<!ELEMENT html:li (#PCDATA)>

You may be able to draw on some of the XHTML modularisation work to import
relevant parts of the HTML DTD, but they may not be using qualified names,
I'm not sure.

Within the XSLT stylesheet, you have to ensure that all the relevant
namespaces are declared so that whenever you use a qualified name (like
'html:UL'), the namespace declaration for it is 'in scope'.  This usually
means putting the namespace attribute on the xsl:stylesheet document element:

<xsl:stylesheet version="1.0"
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
                xmlns:html="http://www.w3.org/1999/xhtml";>
...
</xsl:stylesheet>

Again, you don't have to use the 'html' prefix, but you *do* have to make
sure that the namespace name (the http://www.w3.org/1999/xhtml URI) is the
same in your source XML and your stylesheet.  It's actually the namespace
name (or URI) that is used to determine the namespace that an element is
in, not its prefix.

Within your stylesheet, then, you can now place the rule "copy all HTML
elements".  The following template matches any element in the source that's
within the HTML namespace (whether it's within a 'list' or not):

<xsl:template match="html:*">
  <xsl:copy-of select="." />
</xsl:template>

However, when you're producing HTML output, copying is a bad idea because
while the XSLT processor will produce something that is technically correct
XML, it will not be interpreted correctly by the vast majority of HTML
browsers.  The above, for example, produces:

<html:ul xmlns:html="http://www.w3.org/1999/xhtml";>
   <html:li>List item 1</html:li>
   <html:li>List item 2</html:li>
</html:ul>

because it literally copies everything, including the namespace nodes.
Instead, then, you should create by hand the relevant elements and
attributes, giving them names corresponding to the local part of their
name, without the namespace prefix:

<xsl:template match="html:*">
  <xsl:element name="{local-name()}">
    <xsl:for-each select="@html:*">
      <xsl:attribute name="{local-name()}">
        <xsl:value-of select="." />
      </xsl:attribute>
    </xsl:for-each>
    <xsl:apply-templates />
  </xsl:element>
</xsl:template>

This has the added advantage that if you have any specialised XML embedded
within your HTML elements, it will be treated as that specialised XML
rather than simply copied without paying attention to what it is.

So, to summarise:
1. declare the HTML namespace within your source document (namespace
attribute on document element)
2. change the names of HTML elements within your source document to give
them the relevant namespace prefix
3. add the namespace attribute to the DTD and change the names of the
relevant elements to reflect the namespace prefix
4. declare the HTML namespace within your stylesheet (namespace attribute
on xsl:stylesheet element)
5. use the above template to copy all HTML elements into your result

I hope that this helps,

Jeni

Jeni Tennison
http://www.jenitennison.com/


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread