Re: [xsl] faster complicated counting

Subject: Re: [xsl] faster complicated counting
From: Emmanuel Bégué <medusis@xxxxxxxxx>
Date: Thu, 1 Mar 2012 09:43:54 +0100
One way is to compute the respective position in variables, and then
look them up with keys, so that each position is only computed once.

For example, for the global position, you can add to the root of the
stylesheet:

<xsl:key name="l" match="l" use="@id"/>

<xsl:variable name="global">
	<xsl:for-each select="//l">
		<l pos="{position()}" id="{generate-id(.)}"/>
		</xsl:for-each>
	</xsl:variable>

and then, in each l element, look up the value of wwp:num-global like this:

<xsl:attribute name="wwp:num-global" select="key('l', generate-id(.),
$global)/@pos"/>

Regards,
EB

2012/2/29 Syd Bauman <Syd_Bauman@xxxxxxxxx>:
> I am working with a relatively small dataset (~ 1 MiB) which uses a
> TEI encoding. In TEI, a line of verse is encoded with an <l> element
> (of which I have just about 306,000), which are grouped into groups
> (like poems or stanzas) using <lg> (for "line group").
>
> In the output of the particular process I am working on now, I'd like
> to adorn each <l> element with three new attributes that indicate the
> count of the current <l> element in various contexts:
>  wwp:num-global   = with respect to the entire document
>  wwp:num-local    = with respect to the current stanza or other
>                     small unit of poetry
>  wwp:num-regional = with respect to the current poem or other
>                     large unit of poetry
>
> So, as a toy example, see tiny.in.xml and tiny.out.xml, below.
>
> I have worked out code that gets me the desired counts. My problem is
> that all the tree-walking it does slows down my process by well over
> an order of magnitude. I am betting there is a much better way to do
> this, probably using keys or <xsl:number>, but have not been able to
> wrap my mind around it.
>
> The English-like pseudo-code for @num-local is "the count in the
> context of the closest ancestor <lg> that itself has > 4 metrical
> lines".
>
> The English-like pseudo-code for @num-regional is "the count in the
> context of the closest ancestor <lg> that has a @type that contains
> "poem" or whose first descendant <l> has n='1'".
>
> Here's what I have (note that we are only counting those <l> elements
> that have an @part of 'I' or do not have a @part attribute at all):
>
>  <xsl:attribute name="wwp:num-global">
>    <xsl:number count="l[not(@part)]|l[@part='I']" level="any"/>
>  </xsl:attribute>
>  <xsl:attribute name="wwp:num-regional">
>    <xsl:variable name="region"
>     select="(ancestor::lg[contains( @type,'poem') ]|ancestor::lg[
descendant::l[ @n eq '1'] ])[last()]"/>
>    <xsl:value-of
>    
select="count((preceding::l[not(@part)]|preceding::l[@part='I'])[ancestor::lg
/generate-id() = $region/generate-id() ] ) +1"/>
>  </xsl:attribute>
>  <xsl:attribute name="wwp:num-local">
>    <xsl:variable name="region"
>     select="ancestor::lg[count( descendant::l[not(@part) or @part='I'] ) > 4
][1]"/>
>    <xsl:value-of
>    
select="count((preceding::l[not(@part)]|preceding::l[@part='I'])[ancestor::lg
/generate-id() = $region/generate-id() ] ) +1"/>
>  </xsl:attribute>
>
> Thoughts appreciated.
>
> Notes
> -----
> * Yes, I realize that the test above is for *any* descendant <l> with
>  n='1', not the first. We simply don't have any that aren't the
>  first, so I didn't worry about it.
>
> * It's pretty likely we'll change the definition of what is
>  "regional" in the near future, but it probably won't affect the
>  basic problem I'm having. I.e., I'm hoping that if someone shows me
>  how to do this "regional" better, I'll be able to do any future
>  version on my own. Cross your fingers :-)
>
>
> toy input
> --- -----
> <?xml version="1.0" encoding="UTF-8" standalone="no"?>
> <TEI xmlns="http://www.tei-c.org/ns/1.0";
>     xmlns:wwp="http://www.wwp.brown.edu/ns/textbase/storage/1.0";>
>  <teiHeader>
>    <!-- blah, blah, blah -->
>  </teiHeader>
>  <text>
>    <body>
>      <lg type="superStructure">
>        <lg type="poem.duck">
>          <l>one</l>
>          <l>two</l>
>          <l>three</l>
>          <l>four</l>
>          <l>five</l>
>          <l>six</l>
>          <l>seven</l>
>          <l>eight</l>
>          <l>nine</l>
>          <l>ten</l>
>        </lg>
>        <lg type="poem.duck">
>          <l>one</l>
>          <l>two</l>
>          <l>three</l>
>          <l>four</l>
>          <lg type="tercet">
>            <l>five</l>
>            <l>six</l>
>            <l>seven</l>
>          </lg>
>          <l>eight</l>
>          <l>nine</l>
>          <l>ten</l>
>        </lg>
>        <lg type="poem.duck">
>          <lg type="stanza">
>            <l>one</l>
>            <l>two</l>
>            <l>three</l>
>            <l>four</l>
>            <l>five</l>
>            <l>six</l>
>            <l>seven</l>
>            <l>eight</l>
>          </lg>
>          <lg type="stanza">
>            <l>nine</l>
>            <l>ten</l>
>            <l>eleven</l>
>            <l>twelve</l>
>            <l>thirteen</l>
>            <l>fourteen</l>
>            <l>fifteen</l>
>            <l>sixteen</l>
>          </lg>
>          <lg type="stanza">
>            <l>seventeen</l>
>            <l>eighteen</l>
>            <l>nineteen</l>
>            <l>twenty</l>
>            <l>twentyone</l>
>            <l>twentytwo</l>
>            <l>twentythree</l>
>            <l>twentyfour</l>
>          </lg>
>        </lg>
>      </lg>
>    </body>
>  </text>
> </TEI>
>
> toy code
> --- ----
> <?xml version="1.0" encoding="UTF-8"?>
> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
>  xmlns:wwp="http://www.wwp.brown.edu/ns/textbase/storage/1.0";
xmlns="http://www.tei-c.org/ns/1.0";
>  xpath-default-namespace="http://www.tei-c.org/ns/1.0"; version="2.0">
>
>  <xsl:template match="/">
>    <xsl:text>&#x0A;</xsl:text>
>    <xsl:apply-templates/>
>  </xsl:template>
>  <xsl:template match="@*|text()|processing-instruction()|comment()">
>    <xsl:copy/>
>  </xsl:template>
>  <xsl:template match="*">
>    <xsl:copy>
>      <xsl:apply-templates select="@*|node()"/>
>    </xsl:copy>
>  </xsl:template>
>
>  <xsl:template match="l">
>    <xsl:copy>
>      <xsl:attribute name="wwp:num-global">
>        <xsl:number count="l[not(@part)]|l[@part='I']" level="any"/>
>      </xsl:attribute>
>      <xsl:attribute name="wwp:num-regional">
>        <xsl:variable name="region"
>          select="(ancestor::lg[ contains( @type,'poem') ]|ancestor::lg[
descendant::l[ @n eq '1'] ])[last()]"/>
>        <xsl:value-of
>          select="count(
(preceding::l[not(@part)]|preceding::l[@part='I'])[ancestor::lg/generate-id()
= $region/generate-id() ] ) +1"
>        />
>      </xsl:attribute>
>      <xsl:attribute name="wwp:num-local">
>        <xsl:variable name="region"
>          select="ancestor::lg[count( descendant::l[not(@part) or @part='I']
) > 4 ][1]"/>
>        <xsl:value-of
>          select="count(
(preceding::l[not(@part)]|preceding::l[@part='I'])[ancestor::lg/generate-id()
= $region/generate-id() ] ) +1"
>        />
>      </xsl:attribute>
>      <xsl:apply-templates select="@*|node()"/>
>    </xsl:copy>
>  </xsl:template>
>
> </xsl:stylesheet>
>
> toy output
> --- ------
> <?xml version="1.0" encoding="UTF-8"?>
> <TEI xmlns="http://www.tei-c.org/ns/1.0";
xmlns:wwp="http://www.wwp.brown.edu/ns/textbase/storage/1.0";>
>  <teiHeader>
>    <!-- blah, blah, blah -->
>  </teiHeader>
>  <text>
>    <body>
>      <lg type="superStructure">
>        <lg type="poem.duck">
>          <l wwp:num-global="1" wwp:num-regional="1"
wwp:num-local="1">one</l>
>          <l wwp:num-global="2" wwp:num-regional="2"
wwp:num-local="2">two</l>
>          <l wwp:num-global="3" wwp:num-regional="3"
wwp:num-local="3">three</l>
>          <l wwp:num-global="4" wwp:num-regional="4"
wwp:num-local="4">four</l>
>          <l wwp:num-global="5" wwp:num-regional="5"
wwp:num-local="5">five</l>
>          <l wwp:num-global="6" wwp:num-regional="6"
wwp:num-local="6">six</l>
>          <l wwp:num-global="7" wwp:num-regional="7"
wwp:num-local="7">seven</l>
>          <l wwp:num-global="8" wwp:num-regional="8"
wwp:num-local="8">eight</l>
>          <l wwp:num-global="9" wwp:num-regional="9"
wwp:num-local="9">nine</l>
>          <l wwp:num-global="10" wwp:num-regional="10"
wwp:num-local="10">ten</l>
>        </lg>
>        <lg type="poem.duck">
>          <l wwp:num-global="11" wwp:num-regional="1"
wwp:num-local="1">one</l>
>          <l wwp:num-global="12" wwp:num-regional="2"
wwp:num-local="2">two</l>
>          <l wwp:num-global="13" wwp:num-regional="3"
wwp:num-local="3">three</l>
>          <l wwp:num-global="14" wwp:num-regional="4"
wwp:num-local="4">four</l>
>          <lg type="tercet">
>            <l wwp:num-global="15" wwp:num-regional="5"
wwp:num-local="5">five</l>
>            <l wwp:num-global="16" wwp:num-regional="6"
wwp:num-local="6">six</l>
>            <l wwp:num-global="17" wwp:num-regional="7"
wwp:num-local="7">seven</l>
>          </lg>
>          <l wwp:num-global="18" wwp:num-regional="8"
wwp:num-local="8">eight</l>
>          <l wwp:num-global="19" wwp:num-regional="9"
wwp:num-local="9">nine</l>
>          <l wwp:num-global="20" wwp:num-regional="10"
wwp:num-local="10">ten</l>
>        </lg>
>        <lg type="poem.duck">
>          <lg type="stanza">
>            <l wwp:num-global="21" wwp:num-regional="1"
wwp:num-local="1">one</l>
>            <l wwp:num-global="22" wwp:num-regional="2"
wwp:num-local="2">two</l>
>            <l wwp:num-global="23" wwp:num-regional="3"
wwp:num-local="3">three</l>
>            <l wwp:num-global="24" wwp:num-regional="4"
wwp:num-local="4">four</l>
>            <l wwp:num-global="25" wwp:num-regional="5"
wwp:num-local="5">five</l>
>            <l wwp:num-global="26" wwp:num-regional="6"
wwp:num-local="6">six</l>
>            <l wwp:num-global="27" wwp:num-regional="7"
wwp:num-local="7">seven</l>
>            <l wwp:num-global="28" wwp:num-regional="8"
wwp:num-local="8">eight</l>
>          </lg>
>          <lg type="stanza">
>            <l wwp:num-global="29" wwp:num-regional="9"
wwp:num-local="1">nine</l>
>            <l wwp:num-global="30" wwp:num-regional="10"
wwp:num-local="2">ten</l>
>            <l wwp:num-global="31" wwp:num-regional="11"
wwp:num-local="3">eleven</l>
>            <l wwp:num-global="32" wwp:num-regional="12"
wwp:num-local="4">twelve</l>
>            <l wwp:num-global="33" wwp:num-regional="13"
wwp:num-local="5">thirteen</l>
>            <l wwp:num-global="34" wwp:num-regional="14"
wwp:num-local="6">fourteen</l>
>            <l wwp:num-global="35" wwp:num-regional="15"
wwp:num-local="7">fifteen</l>
>            <l wwp:num-global="36" wwp:num-regional="16"
wwp:num-local="8">sixteen</l>
>          </lg>
>          <lg type="stanza">
>            <l wwp:num-global="37" wwp:num-regional="17"
wwp:num-local="1">seventeen</l>
>            <l wwp:num-global="38" wwp:num-regional="18"
wwp:num-local="2">eighteen</l>
>            <l wwp:num-global="39" wwp:num-regional="19"
wwp:num-local="3">nineteen</l>
>            <l wwp:num-global="40" wwp:num-regional="20"
wwp:num-local="4">twenty</l>
>            <l wwp:num-global="41" wwp:num-regional="21"
wwp:num-local="5">twentyone</l>
>            <l wwp:num-global="42" wwp:num-regional="22"
wwp:num-local="6">twentytwo</l>
>            <l wwp:num-global="43" wwp:num-regional="23"
wwp:num-local="7">twentythree</l>
>            <l wwp:num-global="44" wwp:num-regional="24"
wwp:num-local="8">twentyfour</l>
>          </lg>
>        </lg>
>      </lg>
>    </body>
>  </text>
> </TEI>

Current Thread