[xsl] Tokenizing and transforming a CSV file

Subject: [xsl] Tokenizing and transforming a CSV file
From: Mukul Gandhi <gandhi.mukul@xxxxxxxxx>
Date: Wed, 25 Feb 2009 22:14:26 +0530
Hi all,
  I have a CSV file (named, test.csv) as following (as an example, two
lines/records are shown below):

hi,"this is a long string, please tokenize me",hello,world
hello,please tokenize me,hi there

I want this to be transformed to following XML:

<result>
   <record>
      <field>hi</field>
      <field>this is a long string, please tokenize me</field>
      <field>hello</field>
      <field>world</field>
   </record>
   <record>
      <field>hello</field>
      <field>please tokenize me</field>
      <field>hi there</field>
   </record>
</result>

i.e, each line/record should be tokenized by a comma, with a
restriction that a comma inside a double quoted string should not be
considered as a delimiter:

Below is my attempt upto now.

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
                       version="2.0">

   <xsl:output method="xml" indent="yes" />

   <xsl:variable name="filedata" select="unparsed-text('test.csv')" />

   <xsl:template match="/">
      <result>
        <xsl:for-each select="tokenize($filedata, '\r?\n')">
          <record>
            <xsl:for-each select="tokenize(., ',')">
              <field>
	        <xsl:value-of select="." />
	      </field>
	    </xsl:for-each>
	  </record>
	</xsl:for-each>
      </result>
   </xsl:template>

</xsl:stylesheet>

The above stylesheet produces following output:

<result>
   <record>
      <field>hi</field>
      <field>"this is a long string</field>
      <field> please tokenize me"</field>
      <field>hello</field>
      <field>world</field>
   </record>
   <record>
      <field>hello</field>
      <field>please tokenize me</field>
      <field>hi there</field>
   </record>
</result>

As per my requirement, following output fragment

<field>"this is a long string</field>
<field> please tokenize me"</field>

is wrong.

This should actually appear as:

<field>this is a long string, please tokenize me</field>

I would appreciate any help regarding this problem.

I am using XSLT 2.0 with Saxon 9.x.


-- 
Regards,
Mukul Gandhi

Current Thread