Re: [xsl] Hanging regex

Subject: Re: [xsl] Hanging regex
From: David Carlisle <davidc@xxxxxxxxx>
Date: Sat, 17 Nov 2012 12:40:35 +0000
On 17/11/2012 12:05, Ihe Onwuka wrote:
First let me dissect the regex

	   <xsl:analyze-string select="." flags="x"
			       regex="(.+?)
			              ((-?\d*\s*)+$)"

is targeted at lines of balance sheet text such as below  where we do
not know how many amounts will occur

   1. Total Quick Assets                              1,511
2,829          1,694          4,429

(.+?)  lazily matches the non-financial half of the line  - in this
case it will gobble up 1. Total Quick Assets

((-?\d*\s*)+$) captures the financial half - allowing for a leading
minus sign - the inner brackets are for grouping not capture.

Here is some test data - a file containing the following


I. Current Assets 1,871 2,829 1,694 4,429 1. Total Quick Assets 1,511 2,829 1,694 4,429 Short-term financial instrument 31 16 45 - 2. Total Inventories 359 - - - II. Leased Housing Assets - - - - III. Deferred Liabilities - - - - III.Capital Adjustments - - -28 -30 V. Retained Earnings -2,840 -4,664 -4,363 -4,383



**********************************************************************************************************




FINANCIAL INFORMATION 1. Financial Statements


Income Statement
------------------
									    (Unit : KRW million)
**********************************************************************************************************

Here is the stylesheet

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet
         xmlns:xs="http://www.w3.org/2001/XMLSchema";
         xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
	exclude-result-prefixes="xs" version="2.0">
   <xsl:output indent="yes"/>
   <xsl:param name="input" as="xs:string" required="yes"/>

   <xsl:template match="/">
     <!-- read in text whilst removing comma punctuation from monetary
fields -->
     <xsl:for-each select="tokenize(replace(unparsed-text($input,
'iso-8859-1'),',(\d{3}\D)','$1'), '\r?\n')">

        <!-- Delete lines that don't contain alphanumeric text -->	
        <xsl:if test="matches(.,'\w') and matches(.,'(-|\d)+')">	
          <line>
	   <xsl:analyze-string select="." flags="x"
			       regex="(.+?)
			              ((-?\d*\s*)+$)">

That's a wildly expensive regex. If I change it to


	   <xsl:analyze-string select="." flags="x"
			       regex="(.+?)
			              ((-|\d|\s)+$)">


I get identical output for your input and adding the extra line doesn't make it take appreciably longer



David


Current Thread