RE: [xsl] regexs, grouping (?) and XSLT2?

Subject: RE: [xsl] regexs, grouping (?) and XSLT2?
From: "Michael Kay" <mhk@xxxxxxxxx>
Date: Tue, 10 Aug 2004 15:54:52 +0100
> (Or, of course, you could create a schema for the entire document and
> make sure the source is validated against that schema, so that the
> <mods:dateIssued> element is annotated with the correct type from the
> start.)

I agree, using a schema-defined union type just for use within the
stylesheet, when you aren't using it to describe the source or result
documents, is probably over the top.

It's interesting, though: there's no intrinsic reason why casts from string
to list or union types shouldn't be allowed.
> 
> Second, having the processor assign the correct type doesn't really
> buy you anything anyway, because there's precious little support for
> the xs:gHorribleKludge datatypes in XPath 2.0. 

I think it is worthwhile treating your "union of YYYY-MM-DD, YYYY-MM, or
YYYY" as a user-defined data type (say m:date), and defining your own
function library to manipulate this type. For example, you can define
functions like m:get-year($p as m:date) to extract the year, m:make-date($s
as xs:string) to construct an instance of this type, m:compare() to compare
two instances, and so on. 

I agree that you probably make life easier if you define this by restricting
xs:string rather than as a union over xs:date, xs:gYearMonth, and xs:gYear.
This is partly, as you point out, because you can't cast to a union type,
but also because you can then exploit the fact that the three member types
have a lot in common, for example they all start with YYYY.

If you were
> constructing a function to group the <mods> elements by year, it would
> look something like:
> 
> <xsl:function name="mods:year" as="xs:integer">
>   <xsl:param name="mods" as="element(mods:mods)" />
>   <xsl:variable name="temp" as="element(*, mods:date)">
>     <mods:dateIssued xsl:type="mods:date">
>       <xsl:value-of select="$mods/mods:originInfo/mods:dateIssued" />
>     </mods:dateIssued>
>   </xsl:variable>
>   <xsl:variable name="date" as="xdt:anyAtomicType" 
> select="data($temp)" />
>   <xsl:choose>
>     <xsl:when test="$date instance of xs:date">
>       <xsl:sequence select="year-from-date($date)" />
>     </xsl:when>
>     <xsl:otherwise>
>       <xsl:sequence 
> select="xs:integer(substring(string($date), 1, 4))" />
>     </xsl:otherwise>
>   </xsl:choose>
> </xsl:function>

Just to be clear, this is a function to extract the year component. It's of
course horribly heavy to have to convert the string into a union type just
so you can then extract an integer. If mods:dateIssued were defined in the
schema as an instance of the union type m:date, you could do it like this:

<xsl:function name="m:get-year" as="xs:integer">
  <xsl:param name="date" as="element(*, m:date)"/>
  <xsl:apply-templates select="." mode="get-year"/>
</xsl:function>

<xsl:template match="element(*, xs:date)" mode="get-year">
  <xsl:sequence select="year-from-date(.)"/>
</xsl:template>

<xsl:template match="element(*, xs:gYearMonth) | element(*, xs:gYear)"
mode="get-year">
  <xsl:sequence select="xs:integer(substring(string(.), 1, 4))"/>
</xsl:template>

It's a shame that you have to use template rules in order to get
polymorphism - but at least it's possible in XSLT, which it isn't in XQuery!

> 
> Another point to be made is that if you have a union type, there's no
> way to compare the values within that type with each other: you can't
> compare a xs:date with a xs:gYear, so you can't sort them into the
> order that you'd expect.

You can of course define a function that will compute a sort key, and use
this sort key for comparisons.
> 
> [FWIW, I thought that the schema-aware version would turn out to be
> simple, since Mike's been going on about how much easier life is with
> schema-awareness; I'm surprised at how complicated it turns out to be,
> and it's possible that I'm missing some easier schema-aware method.]

The biggest benefit I've seen from schema-aware processing is in validating
the result document: my experience so far is that this definitely reduces
the time it takes to produce a stylesheet that delivers correct results: not
because the code you write is any different, but because you find the bugs
more quickly.

I've also seen some benefits in processing source documents that have been
schema-validated, by exploiting the type information. This depends very much
on the particular schema, but if the type hierarchy has been properly
designed, then I think you get many opportunities to improve the structure
(and reduce the length) of the stylesheet code by making template rules more
generic (or specific!), and by defining functions to handle common
processing in the same way you normally use methods.

The scenario of creating a schema purely for the benefit of XSLT processing
is a much less likely one, and as I think this example shows, types come
into their own when associated with nodes rather than atomic values. 

Michael Kay

Current Thread