RE: [xsl] Converting CSV to XML without hardcoding schema details in xsl

Subject: RE: [xsl] Converting CSV to XML without hardcoding schema details in xsl
From: "Pantvaidya, Vishwajit" <vpantvai@xxxxxxxxxxxxx>
Date: Thu, 22 Jun 2006 20:50:34 -0700
Thanks a lot for the xsl, Michael.

My CSV has some commas in some cells - in those cases the entire cell value
is itself enclosed in quotes. So a simple tokenize that splits at comma
boundaries would not work - so I replaced the tokenize for the cells with a
regex that took care of the quotes (is there any alternative here other than
using regex?). I had to specify the quotes in the regex as &quot;
After this, it started taking 45 minutes to transform a 20 columns-35 rows
CSV.

Next problem I found was that for columns that contain commas in the value,
all cells in that column are not enclosed in quotes - only those cells that
actually have commas are enclosed in quotes. So I changed the regex to
account for 0/more quotes. Now it transformed in 45 secs - surprise?
But even now, I see that the 0/more quotes regex throws it off and the csv
gets incorrectly parsed resulting in the wrong xml content.

So I made some changes and the current xsl has the regex as:
<xsl:analyze-string select="."
regex="(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),&quo
t;*(.*)&quot;*,(.*),&quot;*(.*)&quot;*,(.*),(.*),&quot;*($.*)&quot;*,(.*)">

(now it is taking even more time - 1hour+ and still not done. Lets see if
atleast the xml comes out correctly.)

Any suggestions to mitigate these regex complexity due to non-uniformity of
input CSV?

Or am I am better off asking the CSV provider of the CSV to keep the CSV
uniform so that either all cells in the column are with/without quotes?


Thanks,

Vish.

>-----Original Message-----
>From: Michael Kay [mailto:mike@xxxxxxxxxxxx]
>Sent: Thursday, June 22, 2006 12:43 AM
>To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
>Subject: RE: [xsl] Converting CSV to XML without hardcoding schema details
>in xsl
>
>> Can anybody suggest how to convert CSV data in the format
>>
>> Field1,Field2
>> Value11,Value12
>>
>> to xml like
>>
>> <Field1>Value11</Field1>
>> <Field2>Value12</Field2>
>>
>> without hardcoding the fieldnames in the xsl?
>
><xsl:variable name="lines" as="xs:string*"
>              select="tokenize(unparsed-text($input-file, '\r?\n'"))"/>
><xsl:variable name="field-names as="xs:string*"
>              select="tokenize($lines[1], ',')"/>
><xsl:for-each select="subsequence($lines,2)">
><row>
>  <xsl:variable name="cells" select="tokenize(., ',')"/>
>  <xsl:for-each select="$cells">
>    <xsl:variable name="p" as="xs:integer" select="position()"/>
>    <xsl:element name="$fields[$p]"/>
>      <xsl:value-of select="."/>
>    </
>  </
></
></
>
>Michael Kay
>http://www.saxonica.com/
>
>
>>
>> I was thinking of something like
>>
>> <xsl:for-each select="tokenize(., ',')"> &lt;<xsl:value-of
>> select="item-at($elementNames,index-of(?parent of current
>> node?,.))"/>&gt; <xsl:value-of select="."/>
>> &lt;/<xsl:value-of
>> select="item-at($elementNames,index-of(?parent of current
>> node?,.))"/>&gt; </xsl:for-each>
>>
>> where elementNames is a tokenized list of the fieldnames -
>> but I am unable to get it to work.
>>
>>
>>
>> >-----Original Message-----
>> >From: Pantvaidya, Vishwajit
>> >Sent: Wednesday, June 21, 2006 12:17 AM
>> >To: 'xsl-list@xxxxxxxxxxxxxxxxxxxxxx'
>> >Subject: [xsl] Converting CSV to XML without hardcoding
>> schema details
>> >in xsl
>> >
>> >Hello,
>> >
>> >I am trying to convert a CSV datafile into XMl format.
>> >The headers for the CSV data are in a file header.csv e.g.
>> >Field1,Field2 The data is in a file Data.csv e.g.
>> >Value11,Value12
>> >Value21,Value22
>> >
>> >I need to convert the CSV data into xml output by creating
>> xml elements
>> >using the names in the csv header and taking the
>> corresponding values
>> >from the data file, so that I get an xml as follows:
>> >
>> ><doc>
>> ><line>
>> ><Field1>Value11</Field1>
>> ><Field2>Value12</Field2>
>> ></line>
>> ><line>
>> ><Field1>Value21</Field1>
>> ><Field2>Value22</Field2>
>> ></line>
>> ></doc>
>> >
>> >I was trying to see if I can do this without hardcoding the header
>> >names in the xsl. I reached upto the point where my xsl
>> looks as below:
>> >
>> ><xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
>> >xmlns:op="http://www.w3.org/2001/12/xquery-operators";
>> >    xmlns:xf="http://www.w3.org/2001/12/xquery-functions";
>> >version="2.0">
>> >
>> >    <xsl:output  name="xmlFormat" method="xml" indent="yes"
>> >omit-xml-declaration="yes"/>
>> >
>> >    <xsl:variable name="source1" select="'data.csv'"/>
>> >    <xsl:variable name="elementNamesList" select="'Header.csv'"/>
>> >    <xsl:variable name="encoding" select="'iso-8859-1'"/>
>> >
>> >    <xsl:variable name="elementNames"
>> >select="tokenize(unparsed-text($elementNamesList,$encoding),',')"/>
>> >    <xsl:variable name="src">
>> >        <doc>
>> >            <xsl:for-each
>> >select="tokenize(unparsed-text($source1,$encoding), '\r?\n')">
>> >                <line>
>> >                    <xsl:for-each select="tokenize(., ',')">
>> >                        &lt;<xsl:value-of
>> >select="op:item-at($elementNames,index-of(?parent of current
>> >node?,.))"/>&gt;
>> >                            <xsl:value-of select="."/>
>> >                            &lt;/<xsl:value-of
>> >select="item-at($elementNames,3)"/>&gt;
>> >                    </xsl:for-each>
>> >                </line>
>> >            </xsl:for-each>
>> >        </doc>
>> >    </xsl:variable>
>> >
>> >    <xsl:template match="/">
>> >        <xsl:result-document format = "xmlFormat" href = "src1.xml">
>> >            <xsl:copy-of select="$src"/>
>> >        </xsl:result-document>
>> >    </xsl:template>
>> >
>> ></xsl:stylesheet>
>> >
>> >In the yet-incomplete statement <xsl:value-of
>> >select="op:item-at($elementNames,index-of(?parent of current
>> >node?,.))"/>, I am trying to generate an xml element with
>> the Nth field
>> >name from the headers name list for the Nth field value. Couple of
>> >issues/questions here:
>> >
>> >- I am getting the error "Cannot find a matching 2-argument function
>> >named {http://www.w3.org/2001/12/xquery-operators}item-at()"
>> when I try
>> >to validate the xsl. What could be the reason?
>> >
>> >- How can I get the ?parent of current node? Needed to compute the
>> >index of the current data in the data record?
>> >
>> >- Is there any other better way to do it? Any way that I can do the
>> >same using xsl:element?
>> >
>> >In general, is this the only/best way or is there any other
>> better way
>> >to achieve the same goal?
>> >
>> >
>> >Thanks and Regards,
>> >
>> >Vish.

Current Thread