Re: [xsl] Grouping elements that have at least one common value

Subject: Re: [xsl] Grouping elements that have at least one common value
From: "Matthieu Ricaud-Dussarget ricaudm@xxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Mon, 26 Jun 2023 06:27:07 -0000
I also implemented Michael's  transitive closure suggestion :

It works fine, but it's quite long to process. On little use-case files
it's Ok but it doesn't go to the end (after more than 15min) on my real big
file (about 33 000 GRCHOIX)
Maybe it has to do with the function els:transitive-closure which iterates
from 1 to 100 000 000 and is recursively called on each GRCHOIX.
Maybe my implementation is not efficient or wrong ? Perhaps it can be fine
tuned to something less greedy ?
Maybe 100 000 000  is too big, and I could adapt it to the size of my
nodes. Any thoughts?

On the same input as in previous mail I get this correct output :
<FORMS>
   <GROUP>
      <GRCHOIX>
         <CHOIX CODE="choix-10"/>
         <CHOIX CODE="choix-11"/>
      </GRCHOIX>
      <GRCHOIX>
         <CHOIX CODE="choix-12"/>
         <CHOIX CODE="choix-14"/>
      </GRCHOIX>
      <GRCHOIX>
         <CHOIX CODE="choix-12"/>
         <CHOIX CODE="choix-15"/>
      </GRCHOIX>
      <GRCHOIX>
         <CHOIX CODE="choix-10"/>
         <CHOIX CODE="choix-13"/>
         <CHOIX CODE="choix-18"/>
      </GRCHOIX>
      <GRCHOIX>
         <CHOIX CODE="choix-11"/>
         <CHOIX CODE="choix-16"/>
      </GRCHOIX>
      <GRCHOIX>
         <CHOIX CODE="choix-12"/>
         <CHOIX CODE="choix-16"/>
      </GRCHOIX>
   </GROUP>
   <GROUP>
      <GRCHOIX>
         <CHOIX CODE="choix-2"/>
         <CHOIX CODE="choix-8"/>
      </GRCHOIX>
   </GROUP>
   <GROUP>
      <GRCHOIX>
         <CHOIX CODE="choix-3"/>
         <CHOIX CODE="choix-5"/>
      </GRCHOIX>
      <GRCHOIX>
         <CHOIX CODE="choix-22"/>
         <CHOIX CODE="choix-3"/>
      </GRCHOIX>
   </GROUP>
</FORMS>

This is my implementation :

<xsl:stylesheet
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
  xmlns:xs="http://www.w3.org/2001/XMLSchema";
  xmlns:els="http://www.lefebvre-sarrut.eu/ns/els";
  exclude-result-prefixes="#all"
  version="3.0">

  <xsl:key name="getGrchoixbyChoixCode" match="GRCHOIX" use="CHOIX/@CODE"/>

  <xsl:variable name="root" select="/" as="document-node()"/>

  <xsl:template match="/FORMS">
    <xsl:copy>
      <xsl:call-template name="els:process">
        <xsl:with-param name="GRCHOIX" as="element()*" select="GRCHOIX"/>
      </xsl:call-template>
    </xsl:copy>
  </xsl:template>

  <xsl:template name="els:process">
    <xsl:param name="GRCHOIX" as="element()*"/>
    <xsl:if test="not(empty($GRCHOIX))">
      <xsl:variable name="start-node" select="$GRCHOIX[1]" as="element()?"/>
      <xsl:variable name="start-node.transitive-closure"
select="els:transitive-closure($start-node)"/>
      <GROUP>
        <xsl:sequence select="$start-node.transitive-closure"/>
      </GROUP>
      <xsl:call-template name="els:process">
        <xsl:with-param name="GRCHOIX" select="$GRCHOIX except
$start-node.transitive-closure"/>
      </xsl:call-template>
    </xsl:if>
  </xsl:template>

  <xsl:function name="els:step" as="element(GRCHOIX)*">
    <xsl:param name="start" as="element(GRCHOIX)"/>
    <xsl:sequence select="key('getGrchoixbyChoixCode', $start/CHOIX/@CODE,
$root)"/>
  </xsl:function>

  <xsl:function name="els:transitive-closure" as="node()*">
    <xsl:param name="start-node" as="node()"/>
    <xsl:iterate select="1 to 100000000">
      <xsl:param name="result" as="node()*" select="()"/>
      <xsl:param name="origin" as="node()*" select="$start-node"/>
      <xsl:variable name="next-nodes" select="($origin ! els:step(.))
except $result"/>
      <xsl:choose>
        <xsl:when test="empty($next-nodes)">
          <xsl:sequence select="$result"/>
          <xsl:break/>
        </xsl:when>
        <xsl:otherwise>
          <xsl:next-iteration>
            <xsl:with-param name="result" select="$result | $next-nodes"/>
            <xsl:with-param name="origin" select="$next-nodes"/>
          </xsl:next-iteration>
        </xsl:otherwise>
      </xsl:choose>
    </xsl:iterate>
  </xsl:function>

</xsl:stylesheet>

Joel, I'll also give a closer look to your TANs functions. We are providing
a method to deliver Maven artifacts of XSLT project (see.
https://www.youtube.com/watch?v=EJGjYQ1XAGk). It's easy to do, keep in
touch if you go to this solution.
Yes Michael pointed out the XSLT 4.0 new feature, but as said, I need it
right now ;)

Cheers
Matthieu

Le lun. 26 juin 2023 C  08:06, Matthieu Ricaud-Dussarget ricaudm@xxxxxxxxx <
xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> a C)crit :

> Hi all,
>
> I go ahead with Martin's solution and have implemented all business rules
> around that "grouping".
> It's fast but I realized that it can generate duplicated groups on my big
> file, which is quite a problem (some people in my company will have to
> spend avec 60 days working on that output as an Excel file)
>
> It's not that easy to reproduce but for example when I have this input :
> <FORMS>
>     <GRCHOIX>
>         <CHOIX CODE="choix-10"/>
>         <CHOIX CODE="choix-11"/>
>     </GRCHOIX>
>     <GRCHOIX>
>         <CHOIX CODE="choix-12"/>
>         <CHOIX CODE="choix-14"/>
>     </GRCHOIX>
>     <GRCHOIX>
>         <CHOIX CODE="choix-12"/>
>         <CHOIX CODE="choix-15"/>
>     </GRCHOIX>
>     <GRCHOIX>
>         <CHOIX CODE="choix-2"/>
>         <CHOIX CODE="choix-8"/>
>     </GRCHOIX>
>     <GRCHOIX>
>         <CHOIX CODE="choix-3"/>
>         <CHOIX CODE="choix-5"/>
>     </GRCHOIX>
>     <GRCHOIX>
>         <CHOIX CODE="choix-22"/>
>         <CHOIX CODE="choix-3"/>
>     </GRCHOIX>
>     <GRCHOIX>
>         <CHOIX CODE="choix-10"/>
>         <CHOIX CODE="choix-13"/>
>         <CHOIX CODE="choix-18"/>
>     </GRCHOIX>
>     <GRCHOIX>
>         <CHOIX CODE="choix-11"/>
>         <CHOIX CODE="choix-16"/>
>     </GRCHOIX>
>     <GRCHOIX>
>         <CHOIX CODE="choix-12"/>
>         <CHOIX CODE="choix-16"/>
>     </GRCHOIX>
> </FORMS>
>
> The output had duplicated GROUP "choix-10/choix-13/choix-18".
>
> <FORMS>
>    <GROUP>
>       <GRCHOIX>
>          <CHOIX CODE="choix-10"/>
>          <CHOIX CODE="choix-11"/>
>       </GRCHOIX>
>       <GRCHOIX>
>          <CHOIX CODE="choix-12"/>
>          <CHOIX CODE="choix-14"/>
>       </GRCHOIX>
>       <GRCHOIX>
>          <CHOIX CODE="choix-12"/>
>          <CHOIX CODE="choix-15"/>
>       </GRCHOIX>
>       <GRCHOIX>
>          <CHOIX CODE="choix-10"/>
>          <CHOIX CODE="choix-13"/>
>          <CHOIX CODE="choix-18"/>
>       </GRCHOIX>
>       <GRCHOIX>
>          <CHOIX CODE="choix-11"/>
>          <CHOIX CODE="choix-16"/>
>       </GRCHOIX>
>       <GRCHOIX>
>          <CHOIX CODE="choix-12"/>
>          <CHOIX CODE="choix-16"/>
>       </GRCHOIX>
>    </GROUP>
>    <GROUP>
>       <GRCHOIX>
>          <CHOIX CODE="choix-2"/>
>          <CHOIX CODE="choix-8"/>
>       </GRCHOIX>
>    </GROUP>
>    <GROUP>
>       <GRCHOIX>
>          <CHOIX CODE="choix-3"/>
>          <CHOIX CODE="choix-5"/>
>       </GRCHOIX>
>       <GRCHOIX>
>          <CHOIX CODE="choix-22"/>
>          <CHOIX CODE="choix-3"/>
>       </GRCHOIX>
>    </GROUP>
>    <GROUP>
>       <GRCHOIX>
>          <CHOIX CODE="choix-10"/>
>          <CHOIX CODE="choix-13"/>
>          <CHOIX CODE="choix-18"/>
>       </GRCHOIX>
>    </GROUP>
>    <GROUP/>
> </FORMS>
>
> I tried to figure out why, but there is something I don't understand in
> the algorithm :
> - xsl:iterate make an iteration on $groups elements.
> - when going into the xsl:otherwise it creates a <GROUP> output.
> Does the "grouping" strategy depends on element order ? Or the same GROUP
> element might still be fed at the next iteration (unlike xsl:for-each ?)
>
> I also give a try to Michael transitive closure algorithm (see next mail)
>
> Cheers
> Matthieu
>
> Le lun. 19 juin 2023 C  22:44, Joel Kalvesmaki director@xxxxxxxxxxxxx <
> xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> a C)crit :
>
>> Hi Matthieu,
>>
>> Currently TAN is a static download, either through github or the
>> website. Making it available through package repos is a future to-do
>> item, as well as better organization into subpackages and breaking out
>> dependencies. The license was designed to encourage other developers to
>> develop their own variations on the code, as needed.
>>
>> A new function proposed for XPath 4.0, currently transitive-closure()
>> (name under discussion, https://github.com/qt4cg/qtspecs/issues/554), is
>> likely to make this task more tractable, and concisely expressed.
>>
>> Best wishes,
>>
>> jk
>>
>>
>> On 2023-06-19 01:38, Matthieu Ricaud-Dussarget ricaudm@xxxxxxxxx wrote:
>> > Hi Joel,
>> >
>> > Thanks for the link to Tan library. I'm not sure I can use it for my
>> > purpose, because it groups the text content of child nodes. But I
>> > guess I could adapt my input or the function code.
>> >
>> > BTW it looks like TAN functions use a lot of other TAN functions,
>> > which means I should get the whole TAN lib to make it work on my
>> > project.
>> >
>> > How is it distributed ? Using http might probably work but it's not
>> > that safe when running on a server of my company that might not be
>> > connected to the internet (or with proxy restrictions for example). Is
>> > TAN library published as a Maven artifact of something like that ?
>> >
>> > Anyway Martin's solution works really fine and performances are really
>> > good so I guess I will stay on this solution for my project.
>> >
>> > Thanks again Martin !
>> >
>> > Now I have to deal with business rules around this "grouping" :)
>> >
>> > Thank you all for your time,
>> >
>> > Cheers
>> >
>> > Matthieu
>> >
>> > Le ven. 16 juin 2023 C  16:54, Joel Kalvesmaki director@xxxxxxxxxxxxx
>> > <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> a C)crit :
>> >
>> >> Hi Matthieu,
>> >>
>> >> You may want to look at tan:group-elements-by-shared-node-values().
>> >>
>> >> Overview:
>> >>
>> >
>>
https://textalign.net/release/TAN-2021/guidelines/xhtml/ch13s02.xhtml#functio
n-group-elements-by-shared-node-values
>> >>
>> >> Code (starting line 272):
>> >>
>> >
>>
https://github.com/textalign/TAN-2021/blob/master/functions/nodes/TAN-fn-node
s-standard.xsl
>> >>
>> >> Joel
>> >>
>> >> On 2023-06-16 05:09, Matthieu Ricaud-Dussarget ricaudm@xxxxxxxxx
>> >> wrote:
>> >>> Hi all,
>> >>>
>> >>> I need to group elements that have at least one common value :
>> >>>
>> >>> <FORMS>
>> >>>
>>
>> --
>> Joel Kalvesmaki
>> Director, Text Alignment Network
>> http://textalign.net
>>
>>
>>
>
> --
> Matthieu Ricaud-Dussarget
> +33 6.63.25.95.58
> XSL-List info and archive <http://www.mulberrytech.com/xsl/xsl-list>
> EasyUnsubscribe <http://lists.mulberrytech.com/unsub/xsl-list/3423379> (by
> email <>)
>


--
Matthieu Ricaud-Dussarget
+33 6.63.25.95.58

Current Thread