Subject: [xsl] Hanging regex From: Ihe Onwuka <ihe.onwuka@xxxxxxxxx> Date: Sat, 17 Nov 2012 12:05:19 +0000 |
First let me dissect the regex <xsl:analyze-string select="." flags="x" regex="(.+?) ((-?\d*\s*)+$)" is targeted at lines of balance sheet text such as below where we do not know how many amounts will occur 1. Total Quick Assets 1,511 2,829 1,694 4,429 (.+?) lazily matches the non-financial half of the line - in this case it will gobble up 1. Total Quick Assets ((-?\d*\s*)+$) captures the financial half - allowing for a leading minus sign - the inner brackets are for grouping not capture. Here is some test data - a file containing the following I. Current Assets 1,871 2,829 1,694 4,429 1. Total Quick Assets 1,511 2,829 1,694 4,429 Short-term financial instrument 31 16 45 - 2. Total Inventories 359 - - - II. Leased Housing Assets - - - - III. Deferred Liabilities - - - - III.Capital Adjustments - - -28 -30 V. Retained Earnings -2,840 -4,664 -4,363 -4,383 ********************************************************************************************************** FINANCIAL INFORMATION 1. Financial Statements Income Statement ------------------ (Unit : KRW million) ********************************************************************************************************** Here is the stylesheet <?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" exclude-result-prefixes="xs" version="2.0"> <xsl:output indent="yes"/> <xsl:param name="input" as="xs:string" required="yes"/> <xsl:template match="/"> <!-- read in text whilst removing comma punctuation from monetary fields --> <xsl:for-each select="tokenize(replace(unparsed-text($input, 'iso-8859-1'),',(\d{3}\D)','$1'), '\r?\n')"> <!-- Delete lines that don't contain alphanumeric text --> <xsl:if test="matches(.,'\w') and matches(.,'(-|\d)+')"> <line> <xsl:analyze-string select="." flags="x" regex="(.+?) ((-?\d*\s*)+$)"> <xsl:matching-substring> <lineItem><xsl:value-of select="normalize-space(regex-group(1))"/></lineItem> <yearlyFigures> <xsl:for-each select="tokenize(normalize-space(regex-group(2)),'\s+')"> <figure year='{position()}'> <xsl:value-of select="."/> </figure> </xsl:for-each> </yearlyFigures> </xsl:matching-substring> <xsl:non-matching-substring> <xsl:value-of select="."/> </xsl:non-matching-substring> </xsl:analyze-string> </line> </xsl:if> </xsl:for-each> </xsl:template> </xsl:stylesheet> and it works very well. However if I add the following text to the data Jan.1,2005 Jan.1,2006 Jan.1,2007 Jan.1,200 it hangs.
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] Line feed in xalan, disap, Chris Wolf | Thread | [xsl] Re: Hanging regex, Ihe Onwuka |
Re: [xsl] Line feed in xalan, disap, David Carlisle | Date | [xsl] Re: Hanging regex, Ihe Onwuka |
Month |