Re: [xsl] tokenizing and counting with xsl:analyze-string

Subject: Re: [xsl] tokenizing and counting with xsl:analyze-string
From: "Michael Kay mike@xxxxxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Sat, 17 Oct 2020 09:39:05 -0000
If you're really keen to avoid putting temporary results in memory, then with
Saxon, I think you can do:

     <xsl:variable name="temp_result" as="xs:boolean*">
          <xsl:analyze-string
select="'abhello1cdehello2fghijklhello3hello4mhello5nhello6'"
                                         regex="hello[1-9]">
             <xsl:matching-substring>
                <xsl:sequence select="true()"/>
             </xsl:matching-substring>
             <xsl:non-matching-substring>
	        <xsl:sequence select="false0"/>
             </xsl:non-matching-substring>
          </xsl:analyze-string>
      </xsl:variable>
      <xsl:iterate select="$temp_result">
          <xsl:param name="m" select="0" as="xs:integer"/>
          <xsl:param name="n" select="0" as="xs:integer"/>
          <xsl:on-completion>
             <result>
                 <yes count="{$m}"/>
                 <no count="{$n}"/>
             </result>
         </xsl:on-completion>
         <xsl:next-iteration>
             <xsl:with-param name="m" select="$m + xs:integer(.)"/>
             <xsl:with-param name="n" select="$n + xs:integer(not(.))"/>
         </xsl:next-iteration>
   </xsl:iterate>

This relies on the fact that Saxon will always try to inline a variable that's
only referenced once; and if the variable is a sequence, this means that the
value will be pipelined ratehr than being materialized in memory. For a
sequence containing a few dozen booleans, that's not going to give any
bottom-line savings. But if the sequence contains millions of items, it
might.

The `xsl:iterate` could also be replaced with a fold:

<xsl:variable name="counts" select="fold-left($temp_result,
                                                                           ma
p{true():0, false():0},
                                                                           fu
nction($val, $next){map:put($val, $next, $val($next)+1)})"
                      as="map(xs:boolean, xs:integer)"/>
<result>
    <yes count="{$counts(true())}"/>
    <no count="{$counts(false())}"/>
</result>

> On 17 Oct 2020, at 10:14, Michael Kay mike@xxxxxxxxxxxx
<xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote:
>
> You can construct a sequence of booleans, in which case you should use
<xsl:sequence select="true()"/> in place of <xsl:value-of select="1"/>, and
then you can use `count($temp_result[.])` and `count($temp_result[not(.)]` to
count the number of true and false items respectively.
>
> If you want to construct the variable as a single string, you can use
xsl:value-of as I suggested, but then you must declare the variable
as="xs:string". But using a sequence of booleans is probably better.
>
> Michael Kay
> Saxonica
>
>
>
>> On 17 Oct 2020, at 10:04, Mukul Gandhi gandhi.mukul@xxxxxxxxx
<mailto:gandhi.mukul@xxxxxxxxx> <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx
<mailto:xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>> wrote:
>>
>> On Sat, Oct 17, 2020 at 1:22 PM Michael Kay mike@xxxxxxxxxxxx
<mailto:mike@xxxxxxxxxxxx> <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx
<mailto:xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>> wrote:
>> With xsl:analyse-string you would still need a variable, but it could be a
simpler variable: for example it might just contain a "1" for a match, and a
"0" for a non-match; at the end you then need to count the ones and zeros
which you can do with string-length(translate(...)).
>>
>> With your suggestion, below mentioned is my new XSLT stylesheet,
>>
>> <xsl:stylesheet version="3.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform
<http://www.w3.org/1999/XSL/Transform>"
>>
xmlns:xs="http://www.w3.org/2001/XMLSchema
<http://www.w3.org/2001/XMLSchema>"
>>
exclude-result-prefixes="xs">
>>
>>    <xsl:output method="xml" indent="yes"/>
>>
>>    <xsl:template match="/">
>>       <xsl:variable name="temp_result" as="xs:boolean*">
>>           <xsl:analyze-string
select="'abhello1cdehello2fghijklhello3hello4mhello5nhello6'"
>>                                          regex="hello[1-9]">
>>              <xsl:matching-substring>
>>                 <xsl:value-of select="1"/>
>>              </xsl:matching-substring>
>>              <xsl:non-matching-substring>
>> 	        <xsl:value-of select="0"/>
>>              </xsl:non-matching-substring>
>>           </xsl:analyze-string>
>>       </xsl:variable>
>>       <result>
>>          <yes count="{count(index-of($temp_result, true()))}"/>
>>          <no count="{count(index-of($temp_result, false()))}"/>
>>       </result>
>>    </xsl:template>
>>
>> </xsl:stylesheet>
>>
>> The above stylesheet gives me the desired result.
>>
>> But the above mentioned XSLT stylesheet, doesn't do exactly what you've
suggested.
>>
>> I would preferably, wish to declare my XSLT variable as follows,
>>
>> <xsl:variable name="temp_result" as="xs:string">
>>     <xsl:analyze-string ...
>> </xsl:variable>
>>
>> with an expectation that, content of this new kind of variable would be a
string (i.e, an atomic xs:string value) of 1 s & 0 s characters, on which I
could do string-length(translate(...)). Is this doable?
>>
>>
>>
>> --
>> Regards,
>> Mukul Gandhi
>> XSL-List info and archive <http://www.mulberrytech.com/xsl/xsl-list>
>> EasyUnsubscribe <http://lists.mulberrytech.com/unsub/xsl-list/293509> (by
email <applewebdata://8452EB5E-55B9-494F-A5B8-B9C3F798A4B0>)
>
> XSL-List info and archive <http://www.mulberrytech.com/xsl/xsl-list>
> EasyUnsubscribe <http://lists.mulberrytech.com/unsub/xsl-list/293509> (by
email <>)

Current Thread