[xsl] Complex Regex takes 201 steps in regex buddy but runs forever in Analyze-String

Subject: [xsl] Complex Regex takes 201 steps in regex buddy but runs forever in Analyze-String
From: Alex Muir <alex.g.muir@xxxxxxxxx>
Date: Mon, 31 Jan 2011 18:40:18 +0000
Hi,

With the following code:
------------------------------

<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
  xmlns:saxon="http://saxon.sf.net/";
xmlns:xs="http://www.w3.org/2001/XMLSchema";
  version="2.0"  exclude-result-prefixes="#all">
  <xsl:output method="xml" indent="no"/>


  <xsl:template match="unknown[exists(text())]">
    <xsl:copy>
      <xsl:copy-of select="@*"/>

      <xsl:call-template name="CompleteListAnalyze">
        <xsl:with-param name="content" select="text()"/>
      </xsl:call-template>

    </xsl:copy>
  </xsl:template>


  <xsl:template name="CompleteListAnalyze">
    <xsl:param name="content"/>

    <xsl:variable name="CompleteListIdentificationRegex" >
      <xsl:text>((B$LISTITEM[^B$]+B$[^B$]+B$/LISTITEMB$)(((B+[^B;B$]+B;|\s+|B
'[^B'B$]+B'){0,255})(B$LISTITEM[^B$]+B$[^B$]+B$/LISTITEMB$)){0,200})</xsl:tex
t>
    </xsl:variable>

    <xsl:analyze-string select="$content"
regex="{$CompleteListIdentificationRegex}">
      <xsl:matching-substring>
        <xsl:text>B$COMPLETELIST POSITION="</xsl:text>
        <xsl:value-of select="position()"/>
        <xsl:text>" PLACEMENT=""B$</xsl:text>
        <xsl:value-of select="regex-group(1)"/>
        <xsl:text>B$b
/COMPLETELISTB$</xsl:text>
      </xsl:matching-substring>
      <xsl:non-matching-substring>
        <xsl:value-of select="."/>
      </xsl:non-matching-substring>
    </xsl:analyze-string>
  </xsl:template>

</xsl:stylesheet>


And the following input file:
----------------------------------

<?xml version="1.0" encoding="UTF-8"?>
<doc>
    <unknown>B$LISTITEM BULLET="15" TITLE="TEXT TEXT TEXT TEXT"
TYPE="SNLI"B$B+B'HLB'FONT size="2"
id="H13211"B;15B+/B'HLB'FONTB;B+/B'HLB'TDB;
   B+B'HLB'TD id="H13213"B;B+/B'HLB'TDB; B+/B'HLB'TRB; B+B'HLB'TR
id="H13215"B;B+B'HLB'TD
id="H13216"B; B+/B'HLB'TDB;B+/B'HLB'TRB; B+B'HLB'TR valign="bottom"
id="H13218"B;
      B+B'HLB'TD id="H13220"B;B+/B'HLB'TDB;         B+B'HLB'TD colspan="2"
id="H13222"B;B+B'HLB'FONT size="2" id="H13223"B;TEXT TEXT TEXT
TEXTB+/B'HLB'FONTB;B$/LISTITEMB$B+/TDB;         B+TD id="H13225"B;B+/TDB;
B+TD id="H13227"B;B+/TDB;         B+TD id="H13229"B;B+/TDB;         B+TD
id="H13231"B;B+/TDB;         B+TD align="right" id="H13233"B;B$LISTITEM
BULLET="16" TITLE="TEXT TEXT TEXT TEXT" TYPE="SNLI"B$B+B'HLB'FONT size="2"
id="H13234"B;16B+/B'HLB'FONTB;B+/B'HLB'TDB;         B+B'HLB'TD
id="H13236"B;B+/B'HLB'TDB; B+/B'HLB'TRB; B+B'HLB'TR id="H13238"B;B+B'HLB'TD
id="H13239"B; B+/B'HLB'TDB;B+/B'HLB'TRB; B+B'HLB'TR valign="bottom"
id="H13241"B;
      B+B'HLB'TD id="H13243"B;B+/B'HLB'TDB;         B+B'HLB'TD colspan="2"
id="H13245"B;B+B'HLB'FONT size="2" id="H13246"B;TEXT TEXT TEXT TEXT TEXT
B+/B'HLB'FONTB;B$/LISTITEMB$B+/TDB;         B+TD id="H13248"B;B+/TDB;
B+TD
id="H13250"B;B+/TDB;         B+TD id="H13252"B;B+/TDB;         B+TD
id="H13254"B;B+/TDB;         B+TD align="right" id="H13256"B;B$LISTITEM
BULLET="17" TITLE="TEXT TEXT TEXT TEXT" TYPE="SNLI"B$B+B'HLB'FONT size="2"
id="H13257"B;17B+/B'HLB'FONTB;B+/B'HLB'TDB;         B+B'HLB'TD
id="H13259"B;B+/B'HLB'TDB; B+/B'HLB'TRB; B+B'HLB'TR id="H13261"B;B+B'HLB'TD
id="H13262"B; B+/B'HLB'TDB;B+/B'HLB'TRB; B+B'HLB'TR valign="bottom"
id="H13264"B;
      B+B'HLB'TD id="H13266"B;B+/B'HLB'TDB;         B+B'HLB'TD colspan="2"
id="H13268"B;B+B'HLB'FONT size="2" id="H13269"B;TEXT TEXT TEXT TEXT TEXT
B+/B'HLB'FONTB;B$/LISTITEMB$</unknown>
</doc>

The regex held in the variable CompleteListIdentificationRegex runs
fine on the same input executing to completion in 201 steps. It
essentially just identifies all the content within the above <unknown>
element.

However the equivalent Analyze-String running in oxygen 12.1 will
continue running and not stop on the same input.

Any ideas?

Been working on it for 4 hours without much progress other than
reducing the number of execution steps in regex buddy by 40.

Thanks Much


--
Alex
-----
Currently:
Freelance Software Engineer 6+ yrs exp

Previously:
https://sites.google.com/a/utg.edu.gm/alex/


A Bafila, is two rivers flowing together as one:
http://www.facebook.com/pages/Bafila/125611807494851

Current Thread