RE: [xsl] Tokenizing and transforming a CSV file

Subject: RE: [xsl] Tokenizing and transforming a CSV file
From: "Michael Kay" <mike@xxxxxxxxxxxx>
Date: Wed, 25 Feb 2009 16:53:28 -0000
I would use xsl:analyze-string rather than tokenize(), with a regex such as

(,"[^"]*")|(,[^,]*)

Michael Kay 
http://www.saxonica.com/

> -----Original Message-----
> From: Mukul Gandhi [mailto:gandhi.mukul@xxxxxxxxx] 
> Sent: 25 February 2009 16:44
> To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
> Subject: [xsl] Tokenizing and transforming a CSV file
> 
> Hi all,
>   I have a CSV file (named, test.csv) as following (as an 
> example, two lines/records are shown below):
> 
> hi,"this is a long string, please tokenize me",hello,world 
> hello,please tokenize me,hi there
> 
> I want this to be transformed to following XML:
> 
> <result>
>    <record>
>       <field>hi</field>
>       <field>this is a long string, please tokenize me</field>
>       <field>hello</field>
>       <field>world</field>
>    </record>
>    <record>
>       <field>hello</field>
>       <field>please tokenize me</field>
>       <field>hi there</field>
>    </record>
> </result>
> 
> i.e, each line/record should be tokenized by a comma, with a 
> restriction that a comma inside a double quoted string should 
> not be considered as a delimiter:
> 
> Below is my attempt upto now.
> 
> <?xml version="1.0" encoding="UTF-8"?>
> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
>                        version="2.0">
> 
>    <xsl:output method="xml" indent="yes" />
> 
>    <xsl:variable name="filedata" select="unparsed-text('test.csv')" />
> 
>    <xsl:template match="/">
>       <result>
>         <xsl:for-each select="tokenize($filedata, '\r?\n')">
>           <record>
>             <xsl:for-each select="tokenize(., ',')">
>               <field>
> 	        <xsl:value-of select="." />
> 	      </field>
> 	    </xsl:for-each>
> 	  </record>
> 	</xsl:for-each>
>       </result>
>    </xsl:template>
> 
> </xsl:stylesheet>
> 
> The above stylesheet produces following output:
> 
> <result>
>    <record>
>       <field>hi</field>
>       <field>"this is a long string</field>
>       <field> please tokenize me"</field>
>       <field>hello</field>
>       <field>world</field>
>    </record>
>    <record>
>       <field>hello</field>
>       <field>please tokenize me</field>
>       <field>hi there</field>
>    </record>
> </result>
> 
> As per my requirement, following output fragment
> 
> <field>"this is a long string</field>
> <field> please tokenize me"</field>
> 
> is wrong.
> 
> This should actually appear as:
> 
> <field>this is a long string, please tokenize me</field>
> 
> I would appreciate any help regarding this problem.
> 
> I am using XSLT 2.0 with Saxon 9.x.
> 
> 
> --
> Regards,
> Mukul Gandhi

Current Thread