[xsl] xslt function for generating grammatical paradigms

Subject: [xsl] xslt function for generating grammatical paradigms
From: David J Birnbaum <djbpitt+xml@xxxxxxxx>
Date: Sun, 20 Apr 2008 22:10:43 -0400
Dear XSLT List,

I'm looking into developing an XSLT 2.0 stylesheet that will take a linguistic stem of the form XYZ- (where X, Y, and Z are the letters in the stem of a lexeme) and generate the full range of endings that occur on that word in the relevant grammatical paradigm. Writing up a set of <stem> elements and a set of <ending> elements and pasting together all possible combinations is easy enough; the problem is sandhi rules, which may cause both the stem-final consonant (Z in the preceding example) and the grammatical ending to change shape in certain circumstances. As a semi-hypothetical example:

1. Given stems "Zen-" and "duS-"

2. Given basic ending "y"

3. "Zen-" plus basic "y" yields "Zeny" (no changes).

4. "duS-" plus basic "y" yields "duSE" (basic "y" is replaced by "E") because it's a property of stem-final "S-" that it causes following grammatical endings that normally begin with "y" to change their first letters to "E". Sequences of "Sy" are fine elsewhere in words; this rule applies only at the juncture of stem and grammatical ending.

A brute-force solution is easy enough; just string together replace() functions like:

<xsl:variable name="$temp06" select="replace('$temp05','S-y','SE')"/>

(where the first rule creates $temp01, feeds it to rule that creates $temp02, etc., and the function ultimately returns the output of the final replace() operation).

This type of brute-force approach would string together dozens (possibly hundreds) of these rules to account for all possible sandhi modifications. That seems inappropriately crude because the rules actually apply to *classes* of letters, so that, for example, basic "y" endings are replaced by "E" not just after "S", but after half a dozen different consonants, as well as after one or two consonant clusters (that is, the last stem consonant isn't the trigger for the change in those cases, it's the combination of the last two).

What I'm groping for, then, is an elegant rule-based function that lets me write a small number of rules by defining classes of letters to which they apply, something like "after 'S', 'Z', 'C', 'St', and 'Zd', 'y' is replaced by 'E'." As I mention above, these rules apply only at the boundary of stem plus ending; "S" can be followed by "y" elsewhere in a word. Since I've encoded my stems with trailing hyphens, I can easily distinguish "Sy" (which should be left alone) from "S-y" (which should be replaced by "SE").

There is also a type of rule where the stem-final consonant changes but the grammatical ending doesn't, along the lines of "when 'E' follows a stem that ends in 'k', 'g', or 'x', that stem-final consonant changes into 'C', 'Z', and 'S', respectively, and the 'E' doesn't change."

Finally, there is a slightly less brute-force approach where I would create not just one paradigm of basic endings plus rules to change them in certain circumstances, but several paradigms that already incorporate the changes, and I would look at the last stem consonant or two and select the appropriate paradigm. Is such a "selection" approach more appropriate for this type of problem than the "modification" approach I've been contemplating?

In any case, I'd be grateful for any pointers to an elegant way of expressing this type of rule in XSLT.

Sincerely,

David
djbpitt+xml@xxxxxxxx <mailto:djbpitt+xml@xxxxxxxx>

Current Thread