Re: [xsl] Looking for a cleaner way of auditing table cell data than this

Subject: Re: [xsl] Looking for a cleaner way of auditing table cell data than this
From: "Eliot Kimber eliot.kimber@xxxxxxxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Fri, 10 Mar 2023 00:02:17 -0000
I can second the recommendation for BaseX as a tool here: itbs easy to
install, it supports XML catalogs out of the box, and you can just point it at
a directory and load it up quick and easy.

If you donbt need DTD-aware parsing itbs really fast. For example, on our
corpus of about 40K DITA documents I can load it from disk in about two
minutes with DTD parsing turned off.

>From the BaseX GUI you can then do whatever XPath or XQuery you want to
analyze and report on your data.

If youbre not familiar with XQuery I also recommend XQuery for Humanists
(https://www.tamupress.com/book/9781623498290/xquery-for-humanists/) as an
excellent introductory how-to text. The target audience is people familiar
with XML but not necessarily XML experts. I found it to provide a really solid
overview of XQuery as well as useful practical examples that you can follow
along with.

Cheers,

E.

_____________________________________________
Eliot Kimber
Sr Staff Content Engineer
O: 512 554 9368
M: 512 554 9368
servicenow.com<https://www.servicenow.com>
LinkedIn<https://www.linkedin.com/company/servicenow> |
Twitter<https://twitter.com/servicenow> |
YouTube<https://www.youtube.com/user/servicenowinc> |
Facebook<https://www.facebook.com/servicenow>

From: Steven D. Majewski steve.majewski@xxxxxxxxx
<xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Thursday, March 9, 2023 at 5:36 PM
To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx <xsl-list@xxxxxxxxxxxxxxxxxxxxxx>
Subject: Re: [xsl] Looking for a cleaner way of auditing table cell data than
this
[External Email]

________________________________
o;?
If you have a substantial library of documents you want to report on, I would
suggest you use an XQuery database like BaseX or eXist that indexes the
documents
of the work with your XPath selector.
If I understand your question, this should select tables with a td with
significant (i.e. non whitespace) text element and a child element on the
list. ( and you can make the list a variable ).

//table/td[normalize-space(.)!=bb][*[local-name() =  ( bparab,
bnoteb, bcnoteb , bcriticalb, bheadlineb, b& )  ]]


On Aug 29, 2022, at 10:37 AM, Trevor Nicholls trevor@xxxxxxxxxxxxxxxxxx
<xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote:

Hi

I have a substantial library of XML documents which include a great number of
tables. As it happens the content model for table cells is promiscuous; a
table cell may contain "block" data:

<td>
  <para>blah blah.</para>
</td>

even to the extent of nested tables:

<td>
  <para>..</para>
  <table>
    <tb>
      ..
    </tb>
  </table>
<td>

or, in the case of very many simple tables, just simple text content:

<td>Y</td>
<td>N</td>

I would like to identify cases where table cells have exploited the
promiscuous schema and mixed both text and block content, for example:

<td>For example:<para>This is a bad table cell.</para></td>

I can't construct the schema so that this is illegal while the earlier
examples are valid. At least I don't think I can. But I would like to identify
these cells (and correct them, but at the moment just reporting them is
sufficient).

This is the XSL fragment I have come up with (using XSL 2), but I imagine
there is a much cleaner way of doing it and I might learn a useful technique
if I ask.

<xsl:template name="mixed-cells">
  <xsl:for-each select="//table">
    <xsl:for-each select="descendant::td[child::text()[normalize-space() !=
'']]">
      <xsl:if test="count(*[self::para | self::note | self::cnote |
self::critical | self::headline | self::error | self::define | self::qanda |
self::inset | self::ihead | self::steps | self::list | self::ol | self::inlist
| self::syntax| self::fragment | self::table]) &gt; 0">
        <xsl:text>Table cell with mixed content: </xsl:text>
        <xsl:call-template name="get-source" />
        <xsl:value-of select="$nl" />
        <xsl:text> content=</xsl:text>
        <xsl:value-of select="normalize-space(.)" />
        <xsl:value-of select="$nl" />
      </xsl:if>
    </xsl:for-each>
  </xsl:for-each>
</xsl:template>

The normalize-space() in the third line is necessary because otherwise it
picks up newlines in a sequence of block children.
The list of "block" elements in the fourth line above is incomplete, and
should probably be sourced from a variable rather than given as a literal
condition the way I have done it here.
The get-source template outputs the input document name and current line
number, and $nl is what you would expect it to be.

As it stands this template is going to report nested table cells multiple
times; there might be a clever fix for this but at the moment my focus is on
the best way to identify these troublesome cells in the first place.

cheers
T
XSL-List info and archive<http://www.mulberrytech.com/xsl/xsl-list>
EasyUnsubscribe<http://lists.mulberrytech.com/unsub/xsl-list/504751> (by
email)

XSL-List info and archive<http://www.mulberrytech.com/xsl/xsl-list>
EasyUnsubscribe<http://lists.mulberrytech.com/unsub/xsl-list/3453418> (by
email<>)

Current Thread