[xsl] CALS table model - finding 'straddle' rows

Subject: [xsl] CALS table model - finding 'straddle' rows
From: Feargal Hogan <feargal.hogan@xxxxxxxxx>
Date: Wed, 26 Feb 2014 12:15:07 +0000
Hi all
I am building an XSLT process to go through a large collection of XML files,
looking for CALS model tables (most of the files contain at least one) and
then store the tabular data back to a database.

In the first instance, many of these tables will have dissimilar structures,
but I want to use the database to analyse commonality of structure. I list
below an (incomplete) extract from an example file.
<table frame="none">
<tgroup cols="6" colsep="0" rowsep="0">
<colspec colname="1" colnum="1" colwidth="127pt" align="center"/>
<colspec colname="2" colnum="2" colwidth="39pt" align="center"/>
<colspec colname="3" colnum="3" colwidth="30pt" align="center"/>
<colspec colname="4" colnum="4" colwidth="33pt" align="center"/>
<colspec colname="5" colnum="5" colwidth="33pt" align="center"/>
<colspec colname="6" colnum="6" colwidth="87pt"/>
<thead>
    <row valign="bottom">
        <entry align="center">Product</entry>
        <entry>SKU</entry>
        <entry>Length</entry>
        <entry>Depth</entry>
        <entry align="center">Weight</entry>
        <entry align="center">Remarks</entry>
    </row>
    <row valign="bottom">
        <entry></entry>
        <entry></entry>
        <entry>(m)</entry>
        <entry>(m)</entry>
        <entry align="center">(kg) </entry>
        <entry align="center"> </entry>
    </row>
</thead>
<tbody>
    <row>
        <entry align="left" namest="1" nameend="6"><hd4>Whites</hd4></entry>
    </row>
    <row>
        <entry>Albion</entry>
        <entry>12345</entry>
        <entry>398</entry>
        <entry>15.5</entry>
        <entry> </entry>
        <entry>N/A </entry>
    </row>
    <row>
        <entry>Rotorua</entry>
        <entry>12346</entry>
        <entry>398</entry>
        <entry>15.5</entry>
        <entry> </entry>
        <entry> </entry>
    </row>
    <row>
        <entry>Quintep</entry>
        <entry>12347</entry>
        <entry>398</entry>
        <entry>15.5</entry>
        <entry> </entry>
        <entry> </entry>
    </row>
...

Because of the dissimilar structures that I know I will encounter during the
process, I am unable to create a table schema in the database that will hold
all this data, other than to store each table cell as an entity with the
following properties:
	 doc_id
	 table_id
	 row_id
	 col_id
	 col_name
	 col_units
	 entry_value

This will allow me to store both string and numeric values as strings.

An additional property that is required in many instances is the 'category' as
defined in the straddle row in the example above at tbody/row[1].

Here the original document creator has added a straddle to categorise the rows
immediately following:
<row>
    <entry align="left" namest="1" nameend="6"><hd4>Whites</hd4></entry>
</row>
These straddle rows are causing me some difficulties. Where they occur, they
'categorise' the rows following UNTIL the next straddle occurs.

My initial document analysis has indicated that there are a number of possible
'types' for the table structure in relation to these straddles:
	 Type 1. That no straddles occur in the tables - simple and easy to process
	 Type 2. That the first row in the tbody is a straddle and that there are
zero or more further straddles below this in the tbody
	 Type 3. That the table contains straddles but NOT in the first row of the
tbody
Type 3 'could' be treated as 2 separate tables, one of type 1 (all the rows up
to but excluding the 1st straddle row) and a Type 2 (all the rows from the 1st
straddle forwards)

So it seems that the key to solving this processing problem is to identify the
position of the 1st straddle, treat everything (zero or more rows) before the
straddle as Type1 and treat everything from the straddle forward as Type 2.

But I am having some difficulty identifying the position of the 1st straddle.
My definition of 1st straddle - in Xpath terms - is
tbody/row[entry [@nameend &gt; @namest]][1]
This allows for the possibility that the straddle is not always keyed from
column 1 and does not always extend into the last column.
Both of these possibilities do exist in the real world data.

There are many similar solutions listed on this page
http://www.dpawson.co.uk/xsl/sect2/flatfile.html#d5010e13

But I am having difficulty applying them to my instances.

Something like this may work
<xsl:key name="straddles" match="row[entry[@nameend &gt; @namest]]"
use="??????"/>
But I'm unsure what to use to define the @use attribute of the key?

When I try to define a first-straddle variable, I don't have a defining value
to pass to the key() function?
<xsl:variable name="first-straddle"
select="table/tgroup/tbody/row[generate-id() =
generate-id(key('straddles',?????))]"/>
How do I find the the location of the first straddle?
What XPath statement accurately locates it?
Thanks in advance
Feargal

Current Thread