Re: [xsl] Grouping elements that have at least one common value

Subject: Re: [xsl] Grouping elements that have at least one common value
From: "Michael Kay michaelkay90@xxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Mon, 26 Jun 2023 09:30:47 -0000
> I also implemented Michael's  transitive closure suggestion :
>
> It works fine, but it's quite long to process. On little use-case files it's
Ok but it doesn't go to the end (after more than 15min) on my real big file
(about 33 000 GRCHOIX)
> Maybe it has to do with the function els:transitive-closure which iterates
from 1 to 100 000 000

The iteration should terminate as soon as no more elements are found, so the
upper limit shouldn't make any difference provided it is high enough. You
could add an error check to make sure that it never gets anywhere near that
(e.g xsl:assert test=". lt 1000").

I can't see any obvious error in your code; I haven't tried to investigate it
myself. Profiling it using the Saxon -TP option might yield insights; also
plotting how the elapsed time increases as the input size increases.

It's possible that the hot-spot is evaluation of the "except" operator between
large node-sets. The performance of "except" is going to be O(n long n) in the
size of the node-sets, which means (especially if the groups are small) that
the overall performance will be quadratic, or slightly worse. A better
solution might be to maintain a map whose keys are the generate-id values (or
index positions) of the nodes that have been processed; on each iteration
select the first node whose generate-id is not in that map.

Michael Kay
Saxonica

>  <xsl:template name="els:process">
>     <xsl:param name="GRCHOIX" as="element(GRCHOIX)*"/>
>     <xsl:if test="not(empty($GRCHOIX))">
>       <xsl:variable name="start-node" select="$GRCHOIX[1]"
as="element()?"/>
>       <xsl:variable name="start-node.transitive-closure"
select="els:transitive-closure($start-node)"/>
>       <GROUP>
>         <xsl:sequence select="$start-node.transitive-closure"/>
>       </GROUP>
>       <xsl:call-template name="els:process">
>         <xsl:with-param name="GRCHOIX" select="$GRCHOIX except
$start-node.transitive-closure"/>
>       </xsl:call-template>
>     </xsl:if>
>   </xsl:template>

> and is recursively called on each GRCHOIX.
> Maybe my implementation is not efficient or wrong ? Perhaps it can be fine
tuned to something less greedy ?
> Maybe 100 000 000  is too big, and I could adapt it to the size of my nodes.
Any thoughts?
>
> On the same input as in previous mail I get this correct output :
> <FORMS>
>    <GROUP>
>       <GRCHOIX>
>          <CHOIX CODE="choix-10"/>
>          <CHOIX CODE="choix-11"/>
>       </GRCHOIX>
>       <GRCHOIX>
>          <CHOIX CODE="choix-12"/>
>          <CHOIX CODE="choix-14"/>
>       </GRCHOIX>
>       <GRCHOIX>
>          <CHOIX CODE="choix-12"/>
>          <CHOIX CODE="choix-15"/>
>       </GRCHOIX>
>       <GRCHOIX>
>          <CHOIX CODE="choix-10"/>
>          <CHOIX CODE="choix-13"/>
>          <CHOIX CODE="choix-18"/>
>       </GRCHOIX>
>       <GRCHOIX>
>          <CHOIX CODE="choix-11"/>
>          <CHOIX CODE="choix-16"/>
>       </GRCHOIX>
>       <GRCHOIX>
>          <CHOIX CODE="choix-12"/>
>          <CHOIX CODE="choix-16"/>
>       </GRCHOIX>
>    </GROUP>
>    <GROUP>
>       <GRCHOIX>
>          <CHOIX CODE="choix-2"/>
>          <CHOIX CODE="choix-8"/>
>       </GRCHOIX>
>    </GROUP>
>    <GROUP>
>       <GRCHOIX>
>          <CHOIX CODE="choix-3"/>
>          <CHOIX CODE="choix-5"/>
>       </GRCHOIX>
>       <GRCHOIX>
>          <CHOIX CODE="choix-22"/>
>          <CHOIX CODE="choix-3"/>
>       </GRCHOIX>
>    </GROUP>
> </FORMS>
>
> This is my implementation :
>
> <xsl:stylesheet
>   xmlns:xsl="http://www.w3.org/1999/XSL/Transform
<http://www.w3.org/1999/XSL/Transform>"
>   xmlns:xs="http://www.w3.org/2001/XMLSchema
<http://www.w3.org/2001/XMLSchema>"
>   xmlns:els="http://www.lefebvre-sarrut.eu/ns/els
<http://www.lefebvre-sarrut.eu/ns/els>"
>   exclude-result-prefixes="#all"
>   version="3.0">
>
>   <xsl:key name="getGrchoixbyChoixCode" match="GRCHOIX" use="CHOIX/@CODE"/>
>
>   <xsl:variable name="root" select="/" as="document-node()"/>
>
>   <xsl:template match="/FORMS">
>     <xsl:copy>
>       <xsl:call-template name="els:process">
>         <xsl:with-param name="GRCHOIX" as="element()*" select="GRCHOIX"/>
>       </xsl:call-template>
>     </xsl:copy>
>   </xsl:template>
>
>   <xsl:template name="els:process">
>     <xsl:param name="GRCHOIX" as="element()*"/>
>     <xsl:if test="not(empty($GRCHOIX))">
>       <xsl:variable name="start-node" select="$GRCHOIX[1]"
as="element()?"/>
>       <xsl:variable name="start-node.transitive-closure"
select="els:transitive-closure($start-node)"/>
>       <GROUP>
>         <xsl:sequence select="$start-node.transitive-closure"/>
>       </GROUP>
>       <xsl:call-template name="els:process">
>         <xsl:with-param name="GRCHOIX" select="$GRCHOIX except
$start-node.transitive-closure"/>
>       </xsl:call-template>
>     </xsl:if>
>   </xsl:template>
>
>   <xsl:function name="els:step" as="element(GRCHOIX)*">
>     <xsl:param name="start" as="element(GRCHOIX)"/>
>     <xsl:sequence select="key('getGrchoixbyChoixCode', $start/CHOIX/@CODE,
$root)"/>
>   </xsl:function>
>
>   <xsl:function name="els:transitive-closure" as="node()*">
>     <xsl:param name="start-node" as="node()"/>
>     <xsl:iterate select="1 to 100000000">
>       <xsl:param name="result" as="node()*" select="()"/>
>       <xsl:param name="origin" as="node()*" select="$start-node"/>
>       <xsl:variable name="next-nodes" select="($origin ! els:step(.)) except
$result"/>
>       <xsl:choose>
>         <xsl:when test="empty($next-nodes)">
>           <xsl:sequence select="$result"/>
>           <xsl:break/>
>         </xsl:when>
>         <xsl:otherwise>
>           <xsl:next-iteration>
>             <xsl:with-param name="result" select="$result | $next-nodes"/>
>             <xsl:with-param name="origin" select="$next-nodes"/>
>           </xsl:next-iteration>
>         </xsl:otherwise>
>       </xsl:choose>
>     </xsl:iterate>
>   </xsl:function>
>
> </xsl:stylesheet>
>
> Joel, I'll also give a closer look to your TANs functions. We are providing
a method to deliver Maven artifacts of XSLT project (see.
https://www.youtube.com/watch?v=EJGjYQ1XAGk
<https://www.youtube.com/watch?v=EJGjYQ1XAGk>). It's easy to do, keep in touch
if you go to this solution.
> Yes Michael pointed out the XSLT 4.0 new feature, but as said, I need it
right now ;)
>
> Cheers
> Matthieu
>
> Le lun. 26 juin 2023 C  08:06, Matthieu Ricaud-Dussarget ricaudm@xxxxxxxxx
<mailto:ricaudm@xxxxxxxxx> <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx
<mailto:xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>> a C)crit :
> Hi all,
>
> I go ahead with Martin's solution and have implemented all business rules
around that "grouping".
> It's fast but I realized that it can generate duplicated groups on my big
file, which is quite a problem (some people in my company will have to spend
avec 60 days working on that output as an Excel file)
>
> It's not that easy to reproduce but for example when I have this input :
> <FORMS>
>     <GRCHOIX>
>         <CHOIX CODE="choix-10"/>
>         <CHOIX CODE="choix-11"/>
>     </GRCHOIX>
>     <GRCHOIX>
>         <CHOIX CODE="choix-12"/>
>         <CHOIX CODE="choix-14"/>
>     </GRCHOIX>
>     <GRCHOIX>
>         <CHOIX CODE="choix-12"/>
>         <CHOIX CODE="choix-15"/>
>     </GRCHOIX>
>     <GRCHOIX>
>         <CHOIX CODE="choix-2"/>
>         <CHOIX CODE="choix-8"/>
>     </GRCHOIX>
>     <GRCHOIX>
>         <CHOIX CODE="choix-3"/>
>         <CHOIX CODE="choix-5"/>
>     </GRCHOIX>
>     <GRCHOIX>
>         <CHOIX CODE="choix-22"/>
>         <CHOIX CODE="choix-3"/>
>     </GRCHOIX>
>     <GRCHOIX>
>         <CHOIX CODE="choix-10"/>
>         <CHOIX CODE="choix-13"/>
>         <CHOIX CODE="choix-18"/>
>     </GRCHOIX>
>     <GRCHOIX>
>         <CHOIX CODE="choix-11"/>
>         <CHOIX CODE="choix-16"/>
>     </GRCHOIX>
>     <GRCHOIX>
>         <CHOIX CODE="choix-12"/>
>         <CHOIX CODE="choix-16"/>
>     </GRCHOIX>
> </FORMS>
>
> The output had duplicated GROUP "choix-10/choix-13/choix-18".
>
> <FORMS>
>    <GROUP>
>       <GRCHOIX>
>          <CHOIX CODE="choix-10"/>
>          <CHOIX CODE="choix-11"/>
>       </GRCHOIX>
>       <GRCHOIX>
>          <CHOIX CODE="choix-12"/>
>          <CHOIX CODE="choix-14"/>
>       </GRCHOIX>
>       <GRCHOIX>
>          <CHOIX CODE="choix-12"/>
>          <CHOIX CODE="choix-15"/>
>       </GRCHOIX>
>       <GRCHOIX>
>          <CHOIX CODE="choix-10"/>
>          <CHOIX CODE="choix-13"/>
>          <CHOIX CODE="choix-18"/>
>       </GRCHOIX>
>       <GRCHOIX>
>          <CHOIX CODE="choix-11"/>
>          <CHOIX CODE="choix-16"/>
>       </GRCHOIX>
>       <GRCHOIX>
>          <CHOIX CODE="choix-12"/>
>          <CHOIX CODE="choix-16"/>
>       </GRCHOIX>
>    </GROUP>
>    <GROUP>
>       <GRCHOIX>
>          <CHOIX CODE="choix-2"/>
>          <CHOIX CODE="choix-8"/>
>       </GRCHOIX>
>    </GROUP>
>    <GROUP>
>       <GRCHOIX>
>          <CHOIX CODE="choix-3"/>
>          <CHOIX CODE="choix-5"/>
>       </GRCHOIX>
>       <GRCHOIX>
>          <CHOIX CODE="choix-22"/>
>          <CHOIX CODE="choix-3"/>
>       </GRCHOIX>
>    </GROUP>
>    <GROUP>
>       <GRCHOIX>
>          <CHOIX CODE="choix-10"/>
>          <CHOIX CODE="choix-13"/>
>          <CHOIX CODE="choix-18"/>
>       </GRCHOIX>
>    </GROUP>
>    <GROUP/>
> </FORMS>
>
> I tried to figure out why, but there is something I don't understand in the
algorithm :
> - xsl:iterate make an iteration on $groups elements.
> - when going into the xsl:otherwise it creates a <GROUP> output.
> Does the "grouping" strategy depends on element order ? Or the same GROUP
element might still be fed at the next iteration (unlike xsl:for-each ?)
>
> I also give a try to Michael transitive closure algorithm (see next mail)
>
> Cheers
> Matthieu
>
> Le lun. 19 juin 2023 C  22:44, Joel Kalvesmaki director@xxxxxxxxxxxxx
<mailto:director@xxxxxxxxxxxxx> <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx
<mailto:xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>> a C)crit :
> Hi Matthieu,
>
> Currently TAN is a static download, either through github or the
> website. Making it available through package repos is a future to-do
> item, as well as better organization into subpackages and breaking out
> dependencies. The license was designed to encourage other developers to
> develop their own variations on the code, as needed.
>
> A new function proposed for XPath 4.0, currently transitive-closure()
> (name under discussion, https://github.com/qt4cg/qtspecs/issues/554
<https://github.com/qt4cg/qtspecs/issues/554>), is
> likely to make this task more tractable, and concisely expressed.
>
> Best wishes,
>
> jk
>
>
> On 2023-06-19 01:38, Matthieu Ricaud-Dussarget ricaudm@xxxxxxxxx
<mailto:ricaudm@xxxxxxxxx> wrote:
> > Hi Joel,
> >
> > Thanks for the link to Tan library. I'm not sure I can use it for my
> > purpose, because it groups the text content of child nodes. But I
> > guess I could adapt my input or the function code.
> >
> > BTW it looks like TAN functions use a lot of other TAN functions,
> > which means I should get the whole TAN lib to make it work on my
> > project.
> >
> > How is it distributed ? Using http might probably work but it's not
> > that safe when running on a server of my company that might not be
> > connected to the internet (or with proxy restrictions for example). Is
> > TAN library published as a Maven artifact of something like that ?
> >
> > Anyway Martin's solution works really fine and performances are really
> > good so I guess I will stay on this solution for my project.
> >
> > Thanks again Martin !
> >
> > Now I have to deal with business rules around this "grouping" :)
> >
> > Thank you all for your time,
> >
> > Cheers
> >
> > Matthieu
> >
> > Le ven. 16 juin 2023 C  16:54, Joel Kalvesmaki director@xxxxxxxxxxxxx
<mailto:director@xxxxxxxxxxxxx>
> > <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx
<mailto:xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>> a C)crit :
> >
> >> Hi Matthieu,
> >>
> >> You may want to look at tan:group-elements-by-shared-node-values().
> >>
> >> Overview:
> >>
> >
https://textalign.net/release/TAN-2021/guidelines/xhtml/ch13s02.xhtml#functio
n-group-elements-by-shared-node-values
<https://textalign.net/release/TAN-2021/guidelines/xhtml/ch13s02.xhtml#functi
on-group-elements-by-shared-node-values>
> >>
> >> Code (starting line 272):
> >>
> >
https://github.com/textalign/TAN-2021/blob/master/functions/nodes/TAN-fn-node
s-standard.xsl
<https://github.com/textalign/TAN-2021/blob/master/functions/nodes/TAN-fn-nod
es-standard.xsl>
> >>
> >> Joel
> >>
> >> On 2023-06-16 05:09, Matthieu Ricaud-Dussarget ricaudm@xxxxxxxxx
<mailto:ricaudm@xxxxxxxxx>
> >> wrote:
> >>> Hi all,
> >>>
> >>> I need to group elements that have at least one common value :
> >>>
> >>> <FORMS>
> >>>
>
> --
> Joel Kalvesmaki
> Director, Text Alignment Network
> http://textalign.net <http://textalign.net/>
>
>
>
>
> --
> Matthieu Ricaud-Dussarget
> +33 6.63.25.95.58
> XSL-List info and archive <http://www.mulberrytech.com/xsl/xsl-list>
> EasyUnsubscribe <http://lists.mulberrytech.com/unsub/xsl-list/3423379> (by
email <applewebdata://F752BB9E-F2AC-4FFA-B1CE-F42EF0640B24>)
>
>
> --
> Matthieu Ricaud-Dussarget
> +33 6.63.25.95.58
> XSL-List info and archive <http://www.mulberrytech.com/xsl/xsl-list>
> EasyUnsubscribe <http://lists.mulberrytech.com/unsub/xsl-list/3500899> (by
email <>)

Current Thread