[xsl] Grouping and sorting using custom collation class with Saxon

Subject: [xsl] Grouping and sorting using custom collation class with Saxon
From: Larry Hayashi <lhtrees@xxxxxxxxx>
Date: Tue, 23 Mar 2010 14:09:35 -0700
I have a built a custom collation and there are a number of
multigraphs in the language I am working in. Here is a sampling of the
sort sequence (minus non-ASCII characters) from the java collation
class.

	("='-';'=';'*' " + /** -,=,* are used to indicate various types of
affixes and clitics. These should be ignored.*/
	"< a,A " +
   	"< '''a,'''A " + /** 'a,'A*/
	"< aa,Aa " +
	"< b,B " +
	"< c,C " +
	"< d,D " +
	"< dz,Dz " +
	"< e,E " +
	"< '''e,'''E " + /** 'e,'E*/
	"< ee,Ee " +
	"< f,F " +
	"< g,G " +
	"< gw,Gw " +
	"< gy,Gy " +
	"< h,H " +
	"< i,I " +
	"< '''i,'''I " + /** 'i,'I*/
	"< ii,Ii " +
	"< k,K " +
	"< k'''K''' " + /** k',K'*/
	"< kw,Kw " +
	"< ky,Ky " +
	"< k'''w,K'''w " +  /** k'w,K'w */
	"< k'''y,K'''y " +  /** k'y,K'y */
	"< l,L " +
	etc.
	"< '''y,'''Y ")

Desired output is something like this:

a,A
**********
-ana
atata

'a,'A
**********
'ap
'atata

etc.

k,K
**********
kaba
kopii
ks=
-ks
ksa

k',K'
*********
k'aba
k'ol

kw,kW
*********
kwduun
kwtaxs

k'w,K'w
*********
k'was
k'wiss
kwiloolag


The source XML structure for each entry looks like this:

<dictionary>
<entry>
    <lexical-unit>
        <form lang="tsi"><text>kaba=</text></form>
    </lexical-unit>
    <trait name="morph-type" value="proclitic"/>
    <sense>
        <grammatical-info value="prenominal"/>
        <gloss lang="en"><text>small</text></gloss>
    </sense>
</entry>
<!--more entries ....->
</dictionary>

Any suggestions as to how to most efficiently group the data according
to the parameters of the custom collation?

Currently, I manually build a regular expression, putting the largest
multigraphs first so that the greedy regex parser chooses the longest
matching string. I use this with xsl:analyze-string to add
@alphaGroupKey to each entry as shown below.

 <xsl:attribute name="alphaGroupKey">
   <xsl:analyze-string select="lexical-unit/form[@lang='tsi']/text"
     regex="^[-=]*((aa|Aa)|(a|A)|(kw|Kw)|(ky|Ky)|(k|K)|(a85|a84))"
     default-collation="http://saxon.sf.net/collation?class=com.lhtrees.xslt.
LangXCollation;">
      <xsl:matching-substring>
        <xsl:analyze-string select="." regex="[^-=\*]+$">
          <xsl:matching-substring>
            <xsl:value-of select="."/>
          </xsl:matching-substring>
        </xsl:analyze-string>
      </xsl:matching-substring>
   </xsl:analyze-string>
 </xsl:attribute>

I can then do the grouping of entries using for-each-group with the
attribute alphaGroupKey.

But I am wondering if there is a pre-existing way to use the custom
collation class to do the grouping so I don't need to build the regex
string. It seems like all of the information that is needed is already
in that class.

Larry

Current Thread