[xsl] XSLT1.0 conditional processing of TABLE TD

Subject: [xsl] XSLT1.0 conditional processing of TABLE TD
From: Indra Chandon <indra@xxxxxxxxxxxxxxx>
Date: Mon, 05 Jun 2006 10:55:36 +1000
Hi everyone,

I'm working with some rather poor HTML as my source XML document (I've done a quick HTMLtidy on the file, but the structure is still awkward) We are trying to grab data out of an old Web site and get it into a database for a new one.

I've attempted to pull out data using a multitude of methods - none quite getting me to the end point that I require. Below you will see my most recent (and I have to add - ridiculously messy) attempt.

I really need some advise and/or code assistance with how extract the element data, so that if a value is missing, I can create a comma delimited space in my output.

My set up -
Editor: XMLSpy
Input: XML (poor HTML <4)
Translate: XSLT1.0
Output: Text (CSV)

Section of input -

<table border="0" cellspacing="0" cellpadding="0">
<tr>
<td>
<table align="left" width="500" border="0" cellpadding="0"
cellspacing="0">
<tr>
<td width="9" align="left" valign="top"><img src=
"/images/spacer.gif" height="5" width="9" /></td>
<td width="272" align="left" valign="top"><img src=
"/images/spacer.gif" height="5" width="272" /></td>
<td width="30" align="left" valign="top"><img src=
"/images/spacer.gif" height="5" width="30" /></td>
<td width="179" align="left" valign="top"><img src=
"/images/spacer.gif" height="5" width="179" /></td>
<td width="1" align="left" valign="top"><img src=
"/images/spacer.gif" height="5" width="1" /></td>
<td width="9" align="left" valign="top"><img src=
"/images/spacer.gif" height="5" width="9" /></td>
</tr>
<tr>
<td align="center" valign="top"></td>
<td align="left" valign="top"><font face=
"Verdana, Arial, Helvetica, Sans Serif" size="1" color="#333333"
class="bodytext"><b>Glen Petersen Architect Pty Ltd<br /></b>Suite
23 Corporate House<br />
Corporation Circuit<br />
TWEED HEADS SOUTH NSW 2486</font></td>
<td align="left" valign="top"><font face=
"Verdana, Arial, Helvetica, Sans Serif" size="1" color="#333333"
class="bodytext">Ph<br />
Fax</font></td>
<td align="left" valign="top"><font face=
"Verdana, Arial, Helvetica, Sans Serif" size="1" color="#333333"
class="bodytext"><b>07 5523 4220<br /></b>07 5523 4110</font></td>
<td align="left" valign="top"></td>
<td align="center" valign="top"></td>
</tr>
<tr>
<td colspan="6" align="center" valign="top"><img src=
"/images/spacer.gif" height="5" width="500" /></td>
</tr>
<tr>
<td align="center" valign="top"></td>
<td align="left" valign="top"><font face=
"Verdana, Arial, Helvetica, Sans Serif" size="1" color="#333333"
class="bodytext">Email <a href=
"mailto:admin@xxxxxxxxxxxxxxxxxx";>admin@xxxxxxxxxxxxxxxxxx</a><br />

Web <a href=
"javascript:goURL('www.gparchitect.com.au',%20'50');">www.gparchitect.com.au</a></font></td>
<td colspan="3" align="left" valign="top"><font face=
"Verdana, Arial, Helvetica, sans-serif" size="1" color=
"#999999">Areas of practice<br />
<font size="-10"><a href="#LEGEND"><img src=
"/files/1/3692/576/1246/commercial.gif" width="16" height="16"
border="0" /></a></font></font></td>
<td align="center" valign="top"></td>
</tr>
<tr>
<td colspan="6" align="left" valign="top"><img src=
"/images/spacer.gif" height="5" width="500" /></td>
</tr>
<tr>
<td align="center" valign="top"></td>
<td colspan="4" align="left" valign="top"></td>
<td align="center" valign="top"></td>
</tr>
<tr>
<td colspan="6" align="left" valign="top"><img src=
"/files/1/1270/warm_grey.gif" height="1" width="500" /></td>
</tr>
</table>
</td>
</tr>
</table>
... the above nested table structure is repeated for each set of data that I need to extract.


Example output required -

Name, Address 1, Address 2, Town/Suburb, State, Postcode, Phone, Fax, Email, Web [new line]

Instance -
Glen Petersen Architect Pty Ltd, Suite 23 Corporate House, Corporation Circuit, Tweed Heads South, NSW, 2486, 07 5523 4220, 07 5523 4110, admin@xxxxxxxxxxxxxxxxxx, www.gparchitect.com.au [&#13;]
Greg Petersen Pty Ltd, 23 Corporate Street, Richmond, VIC, 3121, 03 9874 5402, , greg@xxxxxxxxxxxxxxx, [&#13;]


Conditions for output -
The Town/Suburb needs to be converted to lowercase (I haven't bothered do this yet)
The State and Postcode need to be comma separated (I haven't bothered with this yet either)
The Phone needs to be comma separated if a value is give, otherwise a comma separated whitespace is required (have been trying to work out how I can use the label in the previous row to determine which number is present)
The Fax, like the Phone needs to be comma separated if a value is given otherwise a comma separated whistespace is needed
The Email needs to be comma separated if it is given otherwise a comma separated whitespace is required
The Web needs to be the last entry for each line and if there is no value, then whitespace is required before the new line


Latest transform attempt -

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"; xmlns:html="http://www.w3.org/1999/xhtml";>
<xsl:output method="text" omit-xml-declaration="yes" media-type="text/plain"/>


   <xsl:template match="/">
       <xsl:apply-templates/>
   </xsl:template>

   <xsl:template match="html:table/html:tr/html:td/html:table/html:tr[1]"/>
   <xsl:template match="html:table/html:tr/html:td/html:table/html:tr[3]"/>
   <xsl:template match="html:table/html:tr/html:td/html:table/html:tr[5]"/>
   <xsl:template match="html:table/html:tr/html:td/html:table/html:tr[6]"/>
   <xsl:template match="html:table/html:tr/html:td/html:table/html:tr[7]"/>

<xsl:template match="html:font">
<xsl:call-template name="process-node"/>
</xsl:template>
<xsl:template name="process-node">
<xsl:apply-templates select="html:b"/>
<xsl:call-template name="loop-control">
<xsl:with-param name="n" select="count(child::node())"/>
</xsl:call-template>
</xsl:template>
<xsl:template name="loop-control">
<xsl:param name="n"/>
<xsl:param name="x" select="1"/>
<xsl:if test="$n != 0">
<xsl:choose>
<xsl:when test="child::node()[$x][(self::html:b)]">
<xsl:call-template name="loop-control">
<xsl:with-param name="n" select="$n - 1"/>
<xsl:with-param name="x" select="$x + 1"/>
</xsl:call-template>
</xsl:when>
<xsl:when test="child::node()[$x][(self::html:br)]">
<xsl:call-template name="loop-control">
<xsl:with-param name="n" select="$n - 1"/>
<xsl:with-param name="x" select="$x + 1"/>
</xsl:call-template>
</xsl:when>
<xsl:otherwise>
<xsl:choose>
<xsl:when test="child::node()[$x][contains(., 'Ph')]">
<xsl:call-template name="contact-numbers">
<xsl:with-param name="telephone" select="'true'"/>
</xsl:call-template>
</xsl:when>
<xsl:when test="child::node()[$x][contains(., 'Fax')]">
<xsl:call-template name="contact-numbers">
<xsl:with-param name="facsimilie" select="'true'"/>
</xsl:call-template>
</xsl:when>
</xsl:choose>
<xsl:choose>
<xsl:when test="self::node()[contains(., 'Areas')]">
</xsl:when>
<xsl:when test="self::node()[contains(., 'Email')]">
<xsl:for-each select="self::node()">
<xsl:if test="self::node()[contains(.,'Email')]">
<xsl:call-template name="substituteEmail">
<xsl:with-param name="string" select="self::node()"/>
</xsl:call-template>
</xsl:if>
<xsl:if test="self::node()[contains(.,'Web')]">
<xsl:call-template name="substituteWeb">
<xsl:with-param name="string" select="self::node()"/>
</xsl:call-template>
<xsl:text>&#13;</xsl:text>
</xsl:if>
</xsl:for-each>
</xsl:when>
<xsl:otherwise>
<xsl:call-template name="output-text">
<xsl:with-param name="data" select="normalize-space(child::node()[$x])"/>
</xsl:call-template>
<xsl:call-template name="loop-control">
<xsl:with-param name="n" select="$n - 1"/>
<xsl:with-param name="x" select="$x + 1"/>
</xsl:call-template>
</xsl:otherwise>
</xsl:choose>
</xsl:otherwise>
</xsl:choose>
</xsl:if>
</xsl:template>
<xsl:template match="html:b">
<xsl:value-of select="normalize-space(.)"/>
<xsl:text>,</xsl:text>
</xsl:template>
<xsl:template name="contact-numbers">
<xsl:param name="telephone"/>
<xsl:param name="facsimilie"/>
</xsl:template>
<xsl:template name="output-text">
<xsl:param name="data"/>
<xsl:value-of select="$data"/>
<xsl:text>,</xsl:text>
</xsl:template>
<xsl:template name="substituteEmail">
<xsl:param name="string"/>
<xsl:param name="from" select="'Email'"/>
<xsl:param name="to"/>
<xsl:choose>
<xsl:when test="contains($string, $from)">
<xsl:value-of select="substring-before($string, $from)"/>
<xsl:copy-of select="$to"/>
<xsl:call-template name="substituteEmail">
<xsl:with-param name="string" select="substring-after($string, $from)"/>
<xsl:with-param name="from" select="$from"/>
<xsl:with-param name="to" select="$to"/>
</xsl:call-template>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="$string"/>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
</xsl:stylesheet>


Explanation -
I've got for a standard <xsl:apply-templates> at the root (this was mainly because I'd tried to process individual nodes of the tree initially without getting close to the output I needed).


Next, I call a bunch of non processing <xsl:template> to ensure that I don't traverse the <tr>'s that I don't need to inspect.

Then I grab the <font> elements and <xsl:apply-templates> for the nested <b> content and <xsl:call-templates> for the other nodes in the element.

I have set up a rather cumbersome loop to extract the content using the <br /> as a token.

When I get to the <tr> with the Ph and Fax labels, I <xsl:call-template> with parameters (my idea was to somehow use these to determine whether to output whitespace or a value).

The substituteEmail template is about stripping out the text string from the node (probably overkill, as I could skip the value in outputting, but seemed a good idea a few transforms earlier :-)).

As you can see from the code, there are lots of rather wasteful <xsl:call-templates> but as this will be a one time run, we are happy to sacrifice efficiency for the right result.

I'm not a programmer in any sense of the word, so I struggle with the kind of logic necessary for these types of tasks.

If you are able to suggest a simplier approach to any of the steps I require, I'd be immensely grateful. Likewise, if you can provide comments, corrections or advancements for the code I've provided, I'll be over joyed. :-)

Regards,
Indra Chandon
Semantia Pty Ltd

Current Thread