[xsl] Hanging regex

Subject: [xsl] Hanging regex
From: Ihe Onwuka <ihe.onwuka@xxxxxxxxx>
Date: Sat, 17 Nov 2012 12:05:19 +0000
First let me dissect the regex

	   <xsl:analyze-string select="." flags="x"
			       regex="(.+?)
			              ((-?\d*\s*)+$)"

is targeted at lines of balance sheet text such as below  where we do
not know how many amounts will occur

  1. Total Quick Assets                              1,511
2,829          1,694          4,429

(.+?)  lazily matches the non-financial half of the line  - in this
case it will gobble up 1. Total Quick Assets

((-?\d*\s*)+$) captures the financial half - allowing for a leading
minus sign - the inner brackets are for grouping not capture.

Here is some test data - a file containing the following


 I. Current Assets                                   1,871
2,829          1,694          4,429
  1. Total Quick Assets                              1,511
2,829          1,694          4,429
   Short-term financial instrument                      31
16             45              -
  2. Total Inventories                                 359
 -              -              -
 II. Leased Housing Assets                               -
 -              -              -
 III. Deferred Liabilities                               -
 -              -              -
 III.Capital Adjustments                                 -
 -            -28            -30
 V. Retained Earnings                               -2,840
-4,664         -4,363         -4,383



**********************************************************************************************************




FINANCIAL INFORMATION					             1. Financial Statements

Income Statement
------------------
									    (Unit : KRW million)
**********************************************************************************************************

Here is the stylesheet

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet
        xmlns:xs="http://www.w3.org/2001/XMLSchema";
        xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
	exclude-result-prefixes="xs" version="2.0">
  <xsl:output indent="yes"/>
  <xsl:param name="input" as="xs:string" required="yes"/>

  <xsl:template match="/">
    <!-- read in text whilst removing comma punctuation from monetary
fields -->
    <xsl:for-each select="tokenize(replace(unparsed-text($input,
'iso-8859-1'),',(\d{3}\D)','$1'), '\r?\n')">

       <!-- Delete lines that don't contain alphanumeric text -->	
       <xsl:if test="matches(.,'\w') and matches(.,'(-|\d)+')">	
         <line>
	   <xsl:analyze-string select="." flags="x"
			       regex="(.+?)
			              ((-?\d*\s*)+$)">

            <xsl:matching-substring>
              <lineItem><xsl:value-of
select="normalize-space(regex-group(1))"/></lineItem>

	      <yearlyFigures>
   	        <xsl:for-each
select="tokenize(normalize-space(regex-group(2)),'\s+')">

  		  <figure year='{position()}'>
                    <xsl:value-of select="."/>			
		  </figure>
		</xsl:for-each>	

	      </yearlyFigures>
	    </xsl:matching-substring>

	    <xsl:non-matching-substring>	
              <xsl:value-of select="."/>
            </xsl:non-matching-substring>
	  </xsl:analyze-string>
         </line>
       </xsl:if>
    </xsl:for-each>
  </xsl:template>
	
</xsl:stylesheet>

and it works very well.

However if I add the following text to the data

                                                Jan.1,2005
Jan.1,2006     Jan.1,2007     Jan.1,200

it hangs.

Current Thread