[xsl] csv to xml converter bug

Subject: [xsl] csv to xml converter bug
From: "Andrew Welch" <andrew.j.welch@xxxxxxxxx>
Date: Tue, 10 Jul 2007 11:29:18 +0100
The csv-to-xml solution here:
http://andrewjwelch.com/code/xslt/csv/csv-to-xml.html

...has a bug where

,,"foo,bar",,x,,

generates the tokens:

<token/>
<token/>
<token/>
<token>"foo,bar"</token>
<token/>
<token/>
<token>x</token>
<token/>
<token/>

The x should be at position 5 but is at position 7 because the commas
either side of the quoted values aren't being included with the value
itself, and are generation extra tokens in the
xsl:non-matching-substring block.

I've tried various ways to modify the solution to fix the bug, but
always ran into problems with other strings, such as:

"foo,bar",,"foo,bar",x,,,"foo,bar"

If you include leading or trailing commas with the quoted values then
the empty value at position 2 here gets consumed.  Maybe a better
regex would help here, but I couldn't write one...  (Or perhaps if the
non-matching-substring block had access to some information about the
matching-substring block...)

I had a dig around the net and found a regex[1] that could be
sufficient to just use with tokenize, but it causes the error:

FORX0002: Error at character 2 in regular expression
",(?=([^\"]*\"[^\"]*\")*(?![^\"...":
 expected ())

It works in the "The Regex Coach", but not in XSLT (with Saxon 8.9.0.3b)

The code is:

<xsl:variable name="regex"
as="xs:string">,(?=([^\"]*\"[^\"]*\")*(?![^\"]*\"))</xsl:variable>

<xsl:function name="fn:getTokens" as="xs:string+">
	<xsl:param name="str" as="xs:string"/>
	<xsl:sequence select='for $t in tokenize($str, $regex)
		return replace($t, "^,""|"",$|("")""", "$1")'/>
</xsl:function>

It's an unusual looking regex (to my novice eye) - any explanation as
to whats going on would be great.

thanks
andrew

[1] http://weblogs.asp.net/prieck/archive/2004/01/16/59457.aspx
--
http://andrewjwelch.com

Current Thread