Re: [xsl] analyze-string regex

Subject: Re: [xsl] analyze-string regex
From: Michael Kay <mike@xxxxxxxxxxxx>
Date: Fri, 28 Mar 2014 10:15:21 +0000
I reel with horror when I see complex regular expressions like this. Anything
that relies on regex-group(9) or regex-group(11) is a nightmare.

I've usually found it's possible to split the processing into a number of
phases, and this is the only way I can preserve my sanity.

However, another approach I have seen is to build the regular expression
methodically, for example with a sequence of variables:

<xsl:variable name="number">\d+</xsl:variable>
<xsl:variable name="string">"[^"]*"</xsl:variable>
<xsl:vairable name="number-or-string" select="{$number}|{$string}"/>

or even with function calls

<xsl:vairable name="number-or-string" select="regex:choice($number,
$string)"/>

Unfortunately neither of these approaches really helps much with getting the
group numbers right, but it can make a very large regex much more
comprehensible to the reader, and more likely to be bug-free.

Michael Kay
Saxonica

On 28 Mar 2014, at 09:58, Tony Graham <tgraham@xxxxxxxxxx> wrote:

> On Thu, March 27, 2014 7:19 pm, Liam R E Quin wrote:
>> On Thu, 2014-03-27 at 17:06 +0000, Rushforth, Peter wrote:
>> [...]
>>
>>> What I came up seems to work ok:
>
> The test for regex-group(9) is redundant since if regex-group(11) is not
> an empty string, then regex-group(9) won't be an empty string:
>
> ---
> <xsl:if test="regex-group(11)"><!-- if a bbox exists we've got an option
-->
>  <xsl:element name="option">
>    <xsl:if test="regex-group(9)">
> ---
>
>>>  <xsl:function name="ex:locationJson2Options">
>>>    <xsl:param name="json"/><!--           1    2
>>>    3             4                              5                6
>>>                       7              8 9 10 11                  12
>>>       13                       14                     15 16
>>>             17                                     -->
>>>    <xsl:variable name="regexps"
>>>
select="'(\{.*?(&quot;title&quot;:.*?&quot;(.*?)&quot;).*?(&quot;qualifier&qu
ot;:.*?&quot;(.*?)&quot;).*?(&quot;type&quot;:.*?&quot;(.*?)&quot;).*?((((&qu
ot;bbox&quot;:.*?\[(.*?)\]).*?(&quot;geometry&quot;:.*?(\{.*?\})).*?\}{1,}))|
((&quot;geometry&quot;:.*?(\{.*?\})).*?\}{1,})))'"/>
>>>    <xsl:analyze-string select="$json" regex="{$regexps}" flags="s">
> ...
>> I'd also note you use &quot; a lot, so change them to " and use '....'
>> and &apos; instead. You can also build up a complex expression by making
>
> Or put the regex as the content of the xsl:variable so you don't have to
> worry about either '"' or "'".
>
> If you use include 'x' in the @flags value, you can add white-space for
> readability (and more easily see where you've put in the redundant
> parentheses) as in the example below.
>
> I also suggest making variables for the positions of the significant regex
> groups and using those in regex-group() to make the code more readable.
> If the positions are calculated relative to the previous groups, your code
> is more resilient to changes in the regex (and for bunches of related
> parentheses, e.g., rBBoxGeometry (below), I'd often add a variable, e.g.,
> $rBBoxGeometryLast, for the last parentheses in the bunch and set the next
> variable relative to that to make it resilient to changes in the bunch).
>
>> smaller variables (with comments) and using concat() at the end.
>
> If you do it as content of xsl:variable, you can use xsl:value-of to refer
> to other regex variables.
>
> Regards,
>
>
> Tony Graham                                         tgraham@xxxxxxxxxx
> Consultant                                       http://www.mentea.net
> Chair, Print and Page Layout Community Group @ W3C    XML Guild member
>  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --
> Mentea       XML, XSL-FO and XSLT consulting, training and programming
>
>
> <xsl:variable name="regexps" as="xs:string">
> (                         <!-- 1 -->
> \{\s*
> (                        <!-- rTitle -->
>  "title":\s*"
>  (.*?)
>  "
> )
> \s*
> (                        <!-- rQualifier -->
>  "qualifier":\s*"
>  (.*?)                   <!-- rQualifierData -->
>  "
> )
> \s*
> (                        <!-- rType -->
>  "type":\s*"
>  (.*?)
>  "
> )
> \s*
> (                        <!-- rBBoxGeometry -->
>  (
>   (
>    (                     <!-- rBBox -->
>     "bbox":\s*\[
>     (.*?)                <!-- rBBoxData -->
>     \]
>    )
>    \s*
>    (
>     "geometry":\s*
>     (
>      \{.*?\}
>     )
>    )
>    \s*\}{1,}
>   )
>  )
>  |
>  (                        <!-- rGeometry -->
>   (
>    "geometry":\s*
>    (
>     \{.*?\}
>    )
>   )
>   \s*\}{1,}
>  )
> )
> )
> </xsl:variable>
>
> <xsl:variable name="rTitle" select="2" as="xs:integer" />
> <xsl:variable name="rQualifier" select="$rTitle + 2" as="xs:integer" />
> <xsl:variable name="rQualifierData" select="$rQualifier + 1"
> as="xs:integer" />
> <xsl:variable name="rType" select="$rQualifierData + 1" as="xs:integer" />
> <xsl:variable name="rBBoxGeometry" select="$rType + 2" as="xs:integer" />
> <xsl:variable name="rBBox" select="$rBBoxGeometry + 3" as="xs:integer" />
> <xsl:variable name="rBBoxData" select="$rBBox + 1" as="xs:integer" />
>
> <xsl:function name="ex:locationJson2Options">
>  <xsl:param name="json"/>
>
>  <xsl:analyze-string select="$json" regex="{$regexps}" flags="sx">
>    <xsl:matching-substring>
>      <xsl:if test="regex-group($rBBox)">
> 	<!-- if a bbox exists we've got an option -->
> 	<xsl:element name="option">
> 	  <xsl:if test="regex-group($rBBoxGeometry)">
> 	    <xsl:attribute name="data-bbox"
> 			   select="translate(regex-group($rBBoxData),
> 				             '&#xD;&#xA;|&#xD;|&#xA;',
> 				             '')"/>
> 	  </xsl:if>
> 	  <xsl:value-of select="regex-group($rQualifierData)"/>
> 	</xsl:element>
>      </xsl:if>
>    </xsl:matching-substring>
>  </xsl:analyze-string>
> </xsl:function>

Current Thread