RE: [xsl] Linenumbering & word index

Subject: RE: [xsl] Linenumbering & word index
From: "Michael Kay" <mhk@xxxxxxxxx>
Date: Fri, 6 Aug 2004 16:23:00 +0100
> -----Original Message-----
> From: James Cummings [mailto:James.Cummings@xxxxxxxxxxxxxx] 
> Sent: 06 August 2004 14:41
> To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
> Subject: Re: [xsl] Linenumbering & word index
> 
> On Fri, 6 Aug 2004, David Carlisle wrote:
> 
> > 
> > I lost or forgot the start of this thread so I'll ignore your main
> > questions but I can answer one of the questions in comments
> 
> Right, I'll start from the beginning again then.
> In a document with a lot of poems laid out as:
> <div type="poem">
> <head>headers should be included in word index</head>
> <lg>
> <l>This is a line that really should be included</l>
> <l>This is a line that should be included</l>
> </lg>
> <p>This shouldn't be included</p>
> <lg>
> <l>This is a line that really should be included</l>
> <l>This is a line that should be included</l>
> </lg>
> </div>
> 
> What I want to produce is a word-index of 
> poem number and line number, something like:
> 
> a (4) -- 1:1, 1:2, 1:3, 1:4, 2:3, 2:5 (well, no poem 2 here ;-) )
> be (5) -- 1:head, 1:1, 1:2, 1:3, 1:4
> ...
> really (2) -- 1:1, 1:3, 2:1, 2:3 (if it was in poem 2 as well)

What I was trying to suggest was that you go in two phases:

(a) build a list containing (word, poem number, line number)
(b) group that list by word

and that the output of (a) should be a temporary tree. Sorry if the
reference to position() confused you - I was concentrating on the top-level
design, not the detail.

For example phase 1 might actually be

<xsl:variable name="wordlist">
  <xsl:for-each select="//text()">
    <xsl:for-each select="tokenize(., xxx)">
      <word w=".">
        <poem><xsl:number count="poem"/></poem>
        <line><xsl:number count="l"/></line>
      </word>
    </
  </
</

Michael Kay

> 
> I had previously done word frequency lists as:
> -------
> <xsl:template match="/">
> <xsl:for-each-group 
> select="tokenize(lower-case(string(translate(.,',.!:;',' 
> '))),'\s+')[string(.)]" group-by=".">
>  <xsl:sort />[<xsl:value-of select="."/> - <xsl:value-of 
> select="count(current-group())"/>]
>  </xsl:for-each-group>
>  </xsl:template>
> ------
> 
> And Mike suggested I first build a temporary tree something like:
> <xsl:variable name="words">
> <xsl:for-each select="tokenize(., '\s+')">
> <word value="{.}" position="{position()}"/>
> </xsl:for-each>
> 
> But I don't see how I a) tokenize only the output of l/text() and
> head/text() (it complains of multiple inputs when I do so), and 
> b) how I get line-number and poem-number based on position()?
> --------------
> My completely messed up xsl so far is:
> <xsl:template match="l/text()">
> <xsl:for-each-group select="$words" group-by=".">
> <xsl:sort/>
> <xsl:value-of select="word/@value"/> --   
> <xsl:for-each select="current-group()">
> <a href="#{concat('poem',@poemnumber,'line',@linenumber)}">
> <xsl:value-of select="@poemnumber"/>:<xsl:value-of
> select="@linenumber"/></a>
> </xsl:for-each>
> </xsl:for-each-group>
> </xsl:template>
> 
> <xsl:variable name="words">
> <xsl:for-each select="tokenize(lower-case(string(translate(.,',.!:;','
>  '))),'\s+')[string(.)]">
> <!-- How do I only match text in 'head' and 'l' elements? -->
> <xsl:variable name="poemnumber">
> <!-- How do I get poem number here?  i.e. xsl:number
>      count="div[@type='poem'] when I was matching 'l' " -->
> </xsl:variable>
> <xsl:variable name="linenumber">
> <!-- How do I get line number here? i.e. xsl:number
>      from="div[@type='poem'] when I was matching 'l'-->
> </xsl:variable>
>  <word value="{.}" litposition="{position()}" poemnumber="$poemnumber"
>        linenumber="$linenumber"/>
> </xsl:for-each>
> </xsl:variable>
> 
> <!-- some of the things I don't want to match -->
> <xsl:template match="teiHeader|foreign|p|milestone|gap" 
> priority="-1" />
> ------------------
> 
> Does that clarify my confuddled state of mind?
> 
> -James
> ---
> Dr James Cummings, Oxford Text Archive, University of Oxford
> James dot Cummings at oucs dot ox dot ac dot uk 
> CALL FOR PAPERS: Digital Medievalism (Kalamazoo) and 
> Early Drama (Leeds) see http://users.ox.ac.uk/~jamesc/cfp.html

Current Thread