[xsl] XHTML DTD aware transformation and indentation behaviour

Subject: [xsl] XHTML DTD aware transformation and indentation behaviour
From: Matthieu Ricaud-Dussarget <matthieu.ricaud@xxxxxxxxx>
Date: Thu, 02 Feb 2012 11:48:54 +0100
Hi all,

In my project I concatenate multiple xhtml files in one xml files. This aggregate file has to be edited by hand, that means indentation is important here for convenience.

Before I discovered XML Catalog, I used to delete all DOCTYPE declarations within source XHTML file with a perl script (which also remplace named entities with UTF-8 ones). This worked fine : the concatenated files were indented exactly like the XHTML sources.

But this was a bit dangerous in case I didn't match a special entity to replace with perl. And this was not a really good XML practice.

Now that I'm using a local XML Catalog and run my tranformation with Saxon in command line with this options :
-r:org.apache.xml.resolver.tools.CatalogResolver -x:org.apache.xml.resolver.tools.ResolvingXMLReader -y:org.apache.xml.resolver.tools.ResolvingXMLReader


Lets go in the probleme, my XSL is a simple identity template :

<xsl:output method="xhtml" indent="no" encoding="UTF-8" omit-xml-declaration="no" doctype-public="-//W3C//DTD XHTML 1.1//EN" doctype-system="http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"/>

<xsl:template match="* | @* | processing-instruction() | comment()" mode="copy">
<xsl:copy copy-namespaces="no">
<xsl:apply-templates select="node()|@*" mode="copy"/>
</xsl:copy>
</xsl:template>


<xsl:template match="/">
<xsl:apply-templates mode="copy"/>
</xsl:template>

this is my XML source :
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd";>
<html xmlns="http://www.w3.org/1999/xhtml";>
<head>
<title>title</title>
<link href="my.css" rel="stylesheet" type="text/css" />
<script type="text/javascript" src="my.js"></script>
</head>
<body>
<div class="body">
<div class="pageTitre_container">
<h1>
<span>Title 1</span>
</h1>
<p><span class="big">This</span> is <span class="little">a paragraphe</span></p>
<p><span class="big">This</span> is <span class="little">a paragraphe</span></p>
</div>
</div>
<table>
<caption>This is a table</caption>
<thead>
<tr>
<td>Col 1</td>
<td>Col 2</td>
<td>Col 3</td>
<td>Col 4</td>
<td>Col 5</td>
</tr>
</thead>
<tbody>
<tr>
<td> </td>
<td colspan="3" rowspan="7">
<p class="entitre-en-savoir-">C savoir</p>
<p class="no">
<span class="no-style-override-5">Certains grands magasins proposent des comparatifs trC(s complets, prenez le temps de les parcourir. Vous pouvez C)galement chercher des infos sur Internet via les sites des fabricants, ou sur les forums&#160;: rien ne vaut lbavis dbun consommateur pour se faire une idC)e prC)cise du produit&#160;!</span>
</p>
</td>
<td> </td>
</tr>
<tr>
<td> </td>
<td> </td>
</tr>
<tr>
<td> </td>
<td> </td>
</tr>
<tr>
<td> </td>
<td> </td>
</tr>
<tr>
<td> </td>
<td> </td>
</tr>
<tr>
<td> </td>
<td> </td>
</tr>
<tr>
<td> </td>
<td> </td>
</tr>
<tr>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
</tbody>
</table>
</body>
</html>


Which gives as output :

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd";>
<html xmlns="http://www.w3.org/1999/xhtml";><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>title</title><link href="my.css" rel="stylesheet" type="text/css" /><script type="text/javascript" src="my.js"></script></head><body><div class="body">
<div class="pageTitre_container">
<h1>
<span>Title 1</span>
</h1>
<p><span class="big">This</span> is <span class="little">a paragraphe</span></p>
<p><span class="big">This</span> is <span class="little">a paragraphe</span></p>
</div>
</div><table><caption>This is a table</caption><thead><tr><td>Col 1</td><td>Col 2</td><td>Col 3</td><td>Col 4</td><td>Col 5</td></tr></thead><tbody><tr><td> </td><td colspan="3" rowspan="7">
<p class="entitre-en-savoir-">C savoir</p>
<p class="no">
<span class="no-style-override-5">Certains grands magasins proposent des comparatifs trC(s complets, prenez le temps de les parcourir. Vous pouvez C)galement chercher des infos sur Internet via les sites des fabricants, ou sur les forums : rien ne vaut lbavis dbun consommateur pour se faire une idC)e prC)cise du produit !</span>
</p>
</td><td> </td></tr><tr><td> </td><td> </td></tr><tr><td> </td><td> </td></tr><tr><td> </td><td> </td></tr><tr><td> </td><td> </td></tr><tr><td> </td><td> </td></tr><tr><td> </td><td> </td></tr><tr><td> </td><td> </td><td> </td><td> </td><td> </td></tr></tbody></table></body></html>


If I comment the DOCTYPE in the source I get :

<?xml version="1.0" encoding="UTF-8"?><!--<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd";>-->
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd";>
<html xmlns="http://www.w3.org/1999/xhtml";>
<head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>title</title>
<link href="my.css" rel="stylesheet" type="text/css" />
<script type="text/javascript" src="my.js"></script>
</head>
<body>
<div class="body">
<div class="pageTitre_container">
<h1>
<span>Title 1</span>
</h1>
<p><span class="big">This</span> is <span class="little">a paragraphe</span></p>
<p><span class="big">This</span> is <span class="little">a paragraphe</span></p>
</div>
</div>
<table>
<caption>This is a table</caption>
<thead>
<tr>
<td>Col 1</td>
<td>Col 2</td>
<td>Col 3</td>
<td>Col 4</td>
<td>Col 5</td>
</tr>
</thead>
<tbody>
<tr>
<td> </td>
<td colspan="3" rowspan="7">
<p class="entitre-en-savoir-">C savoir</p>
<p class="no">
<span class="no-style-override-5">Certains grands magasins proposent des comparatifs trC(s complets, prenez le temps de les parcourir. Vous pouvez C)galement chercher des infos sur Internet via les sites des fabricants, ou sur les forums : rien ne vaut lbavis dbun consommateur pour se faire une idC)e prC)cise du produit !</span>
</p>
</td>
<td> </td>
</tr>
<tr>
<td> </td>
<td> </td>
</tr>
<tr>
<td> </td>
<td> </td>
</tr>
<tr>
<td> </td>
<td> </td>
</tr>
<tr>
<td> </td>
<td> </td>
</tr>
<tr>
<td> </td>
<td> </td>
</tr>
<tr>
<td> </td>
<td> </td>
</tr>
<tr>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
</tbody>
</table>
</body>
</html>



the head element is now indented and the table too, this is what i would like... but I don't want to comment the doctype in the source.


Has it something to do with the XHTML DTD model ? Any Idea how to achieve what I'd like ?

Thanks,

Matthieu.

Current Thread