Re: [xsl] CALS table model - finding 'straddle' rows

Subject: Re: [xsl] CALS table model - finding 'straddle' rows
From: Michael Kay <mike@xxxxxxxxxxxx>
Date: Wed, 26 Feb 2014 12:37:44 +0000
It sounds like a positional grouping problem to me:

<xsl;template match="rbody">
  <xsl:for-each-group select="row" group-starting-with="row[entry [@nameend
&gt; @namest]][1]">


Michael Kay
Saxonica




On 26 Feb 2014, at 12:15, Feargal Hogan <feargal.hogan@xxxxxxxxx> wrote:

>
> Hi all
> I am building an XSLT process to go through a large collection of XML files,
looking for CALS model tables (most of the files contain at least one) and
then store the tabular data back to a database.
>
> In the first instance, many of these tables will have dissimilar structures,
but I want to use the database to analyse commonality of structure. I list
below an (incomplete) extract from an example file.
> <table frame="none">
> <tgroup cols="6" colsep="0" rowsep="0">
> <colspec colname="1" colnum="1" colwidth="127pt" align="center"/>
> <colspec colname="2" colnum="2" colwidth="39pt" align="center"/>
> <colspec colname="3" colnum="3" colwidth="30pt" align="center"/>
> <colspec colname="4" colnum="4" colwidth="33pt" align="center"/>
> <colspec colname="5" colnum="5" colwidth="33pt" align="center"/>
> <colspec colname="6" colnum="6" colwidth="87pt"/>
> <thead>
>    <row valign="bottom">
>        <entry align="center">Product</entry>
>        <entry>SKU</entry>
>        <entry>Length</entry>
>        <entry>Depth</entry>
>        <entry align="center">Weight</entry>
>        <entry align="center">Remarks</entry>
>    </row>
>    <row valign="bottom">
>        <entry></entry>
>        <entry></entry>
>        <entry>(m)</entry>
>        <entry>(m)</entry>
>        <entry align="center">(kg) </entry>
>        <entry align="center"> </entry>
>    </row>
> </thead>
> <tbody>
>    <row>
>        <entry align="left" namest="1" nameend="6"><hd4>Whites</hd4></entry>
>    </row>
>    <row>
>        <entry>Albion</entry>
>        <entry>12345</entry>
>        <entry>398</entry>
>        <entry>15.5</entry>
>        <entry> </entry>
>        <entry>N/A </entry>
>    </row>
>    <row>
>        <entry>Rotorua</entry>
>        <entry>12346</entry>
>        <entry>398</entry>
>        <entry>15.5</entry>
>        <entry> </entry>
>        <entry> </entry>
>    </row>
>    <row>
>        <entry>Quintep</entry>
>        <entry>12347</entry>
>        <entry>398</entry>
>        <entry>15.5</entry>
>        <entry> </entry>
>        <entry> </entry>
>    </row>
> ...
>
> Because of the dissimilar structures that I know I will encounter during the
process, I am unable to create a table schema in the database that will hold
all this data, other than to store each table cell as an entity with the
following properties:
> 	 doc_id
> 	 table_id
> 	 row_id
> 	 col_id
> 	 col_name
> 	 col_units
> 	 entry_value
>
> This will allow me to store both string and numeric values as strings.
>
> An additional property that is required in many instances is the 'category'
as defined in the straddle row in the example above at tbody/row[1].
>
> Here the original document creator has added a straddle to categorise the
rows immediately following:
> <row>
>    <entry align="left" namest="1" nameend="6"><hd4>Whites</hd4></entry>
> </row>
> These straddle rows are causing me some difficulties. Where they occur, they
'categorise' the rows following UNTIL the next straddle occurs.
>
> My initial document analysis has indicated that there are a number of
possible 'types' for the table structure in relation to these straddles:
> 	 Type 1. That no straddles occur in the tables - simple and easy to
process
> 	 Type 2. That the first row in the tbody is a straddle and that there are
zero or more further straddles below this in the tbody
> 	 Type 3. That the table contains straddles but NOT in the first row of the
tbody
> Type 3 'could' be treated as 2 separate tables, one of type 1 (all the rows
up to but excluding the 1st straddle row) and a Type 2 (all the rows from the
1st straddle forwards)
>
> So it seems that the key to solving this processing problem is to identify
the position of the 1st straddle, treat everything (zero or more rows) before
the straddle as Type1 and treat everything from the straddle forward as Type
2.
>
> But I am having some difficulty identifying the position of the 1st
straddle. My definition of 1st straddle - in Xpath terms - is
> tbody/row[entry [@nameend &gt; @namest]][1]
> This allows for the possibility that the straddle is not always keyed from
column 1 and does not always extend into the last column.
> Both of these possibilities do exist in the real world data.
>
> There are many similar solutions listed on this page
http://www.dpawson.co.uk/xsl/sect2/flatfile.html#d5010e13
>
> But I am having difficulty applying them to my instances.
>
> Something like this may work
> <xsl:key name="straddles" match="row[entry[@nameend &gt; @namest]]"
use="??????"/>
> But I'm unsure what to use to define the @use attribute of the key?
>
> When I try to define a first-straddle variable, I don't have a defining
value to pass to the key() function?
> <xsl:variable name="first-straddle"
select="table/tgroup/tbody/row[generate-id() =
generate-id(key('straddles',?????))]"/>
> How do I find the the location of the first straddle?
> What XPath statement accurately locates it?
> Thanks in advance
> Feargal

Current Thread