RE: [xsl] csv to xml converter bug

Subject: RE: [xsl] csv to xml converter bug
From: "Michael Kay" <mike@xxxxxxxxxxxx>
Date: Tue, 10 Jul 2007 12:21:36 +0100
The construct

(?=X)

is allowed in some regex dialects, it means "match X with a zero-width
positive lookahead". But it's not allowed in the XPath regex dialect. This
is basically an assertion that X must match at the current position, without
causing X to be swallowed.

This construct (a zero-width negative lookahead) isn't allowed either:

(?!X) 

This is the inverse: it asserts that X does not match at the current
position, without swallowing X.

I'm afraid I have no idea whether these constructs can be translated into
anything that the XPath regex dialect permits.

Gunther Schadow can say "told you it would be needed":
http://www.stylusstudio.com/xsllist/200412/post00810.html


Michael Kay
http://www.saxonica.com/


> -----Original Message-----
> From: Andrew Welch [mailto:andrew.j.welch@xxxxxxxxx] 
> Sent: 10 July 2007 11:29
> To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
> Subject: [xsl] csv to xml converter bug
> 
> The csv-to-xml solution here:
> http://andrewjwelch.com/code/xslt/csv/csv-to-xml.html
> 
> ...has a bug where
> 
> ,,"foo,bar",,x,,
> 
> generates the tokens:
> 
> <token/>
> <token/>
> <token/>
> <token>"foo,bar"</token>
> <token/>
> <token/>
> <token>x</token>
> <token/>
> <token/>
> 
> The x should be at position 5 but is at position 7 because 
> the commas either side of the quoted values aren't being 
> included with the value itself, and are generation extra 
> tokens in the xsl:non-matching-substring block.
> 
> I've tried various ways to modify the solution to fix the 
> bug, but always ran into problems with other strings, such as:
> 
> "foo,bar",,"foo,bar",x,,,"foo,bar"
> 
> If you include leading or trailing commas with the quoted 
> values then the empty value at position 2 here gets consumed. 
>  Maybe a better regex would help here, but I couldn't write 
> one...  (Or perhaps if the non-matching-substring block had 
> access to some information about the matching-substring block...)
> 
> I had a dig around the net and found a regex[1] that could be 
> sufficient to just use with tokenize, but it causes the error:
> 
> FORX0002: Error at character 2 in regular expression
> ",(?=([^\"]*\"[^\"]*\")*(?![^\"...":
>   expected ())
> 
> It works in the "The Regex Coach", but not in XSLT (with 
> Saxon 8.9.0.3b)
> 
> The code is:
> 
> <xsl:variable name="regex"
> as="xs:string">,(?=([^\"]*\"[^\"]*\")*(?![^\"]*\"))</xsl:variable>
> 
> <xsl:function name="fn:getTokens" as="xs:string+">
> 	<xsl:param name="str" as="xs:string"/>
> 	<xsl:sequence select='for $t in tokenize($str, $regex)
> 		return replace($t, "^,""|"",$|("")""", "$1")'/> 
> </xsl:function>
> 
> It's an unusual looking regex (to my novice eye) - any 
> explanation as to whats going on would be great.
> 
> thanks
> andrew
> 
> [1] http://weblogs.asp.net/prieck/archive/2004/01/16/59457.aspx
> --
> http://andrewjwelch.com

Current Thread