RE: [xsl] Converting CSV to XML without hardcoding schema details in xsl

Subject: RE: [xsl] Converting CSV to XML without hardcoding schema details in xsl
From: "Pantvaidya, Vishwajit" <vpantvai@xxxxxxxxxxxxx>
Date: Fri, 23 Jun 2006 10:35:24 -0700
Thanks Nathan - considering the problems with the CSV files, I was thinking
of writing a simple Java program to convert csv to xml...


>-----Original Message-----
>From: Nathan Young -X (natyoung - Artizen at Cisco)
>[mailto:natyoung@xxxxxxxxx]
>Sent: Friday, June 23, 2006 9:53 AM
>To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
>Subject: RE: [xsl] Converting CSV to XML without hardcoding schema details
>in xsl
>
>Hi.
>
>The rules you describe for handling cells with commas in them and cells
>with quotes in them are widely used conventions for encoding data in
>csv.  Unless you are able to prevent cells from ever containing commas
>or quotes you will not be able to make the csv "uniform" in a way that
>does not require these (or some other) irregularities.
>
>There is another way of parsing csv files that works faster than regular
>expressions, very generally by reading the file character by character
>into a buffer and applying a set of rules at each character to decide if
>you have reached the end of a cell, at which point you empty the buffer
>into a cell variable (or whatever you need to do with it) and continue.
>I think this is best not done in XSL though.
>
>If performance is indeed an issue, you are likely to be well served by
>parsing out the csv file into a very simple XML format using another
>language.  Many existing programming languages have very robust and
>performant csv parsers for them already, so you'd have that problem
>mostly solved from the outset.
>
>------------>Nathan
>
>
>
>.:||:._.:||:._.:||:._.:||:._.:||:._.:||:._.:||:._.:||:._.:||:._.:||:._.:
>||:.
>
>Nathan Young
>Cisco.com->Interface Development
>A: ncy1717
>E: natyoung@xxxxxxxxx
>
>> -----Original Message-----
>> From: Pantvaidya, Vishwajit [mailto:vpantvai@xxxxxxxxxxxxx]
>> Sent: Thursday, June 22, 2006 8:51 PM
>> To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
>> Subject: RE: [xsl] Converting CSV to XML without hardcoding
>> schema details in xsl
>>
>> Thanks a lot for the xsl, Michael.
>>
>> My CSV has some commas in some cells - in those cases the
>> entire cell value
>> is itself enclosed in quotes. So a simple tokenize that
>> splits at comma
>> boundaries would not work - so I replaced the tokenize for
>> the cells with a
>> regex that took care of the quotes (is there any alternative
>> here other than
>> using regex?). I had to specify the quotes in the regex as &quot;
>> After this, it started taking 45 minutes to transform a 20
>> columns-35 rows
>> CSV.
>>
>> Next problem I found was that for columns that contain commas
>> in the value,
>> all cells in that column are not enclosed in quotes - only
>> those cells that
>> actually have commas are enclosed in quotes. So I changed the regex to
>> account for 0/more quotes. Now it transformed in 45 secs - surprise?
>> But even now, I see that the 0/more quotes regex throws it
>> off and the csv
>> gets incorrectly parsed resulting in the wrong xml content.
>>
>> So I made some changes and the current xsl has the regex as:
>> <xsl:analyze-string select="."
>> regex="(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),
>(.*),(.*),&quo
>> t;*(.*)&quot;*,(.*),&quot;*(.*)&quot;*,(.*),(.*),&quot;*($.*)&
>quot;*,(.*)">
>>
>> (now it is taking even more time - 1hour+ and still not done.
>> Lets see if
>> atleast the xml comes out correctly.)
>>
>> Any suggestions to mitigate these regex complexity due to
>> non-uniformity of
>> input CSV?
>>
>> Or am I am better off asking the CSV provider of the CSV to
>> keep the CSV
>> uniform so that either all cells in the column are
>> with/without quotes?
>>
>>
>> Thanks,
>>
>> Vish.
>>
>> >-----Original Message-----
>> >From: Michael Kay [mailto:mike@xxxxxxxxxxxx]
>> >Sent: Thursday, June 22, 2006 12:43 AM
>> >To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
>> >Subject: RE: [xsl] Converting CSV to XML without hardcoding
>> schema details
>> >in xsl
>> >
>> >> Can anybody suggest how to convert CSV data in the format
>> >>
>> >> Field1,Field2
>> >> Value11,Value12
>> >>
>> >> to xml like
>> >>
>> >> <Field1>Value11</Field1>
>> >> <Field2>Value12</Field2>
>> >>
>> >> without hardcoding the fieldnames in the xsl?
>> >
>> ><xsl:variable name="lines" as="xs:string*"
>> >              select="tokenize(unparsed-text($input-file,
>> '\r?\n'"))"/>
>> ><xsl:variable name="field-names as="xs:string*"
>> >              select="tokenize($lines[1], ',')"/>
>> ><xsl:for-each select="subsequence($lines,2)">
>> ><row>
>> >  <xsl:variable name="cells" select="tokenize(., ',')"/>
>> >  <xsl:for-each select="$cells">
>> >    <xsl:variable name="p" as="xs:integer" select="position()"/>
>> >    <xsl:element name="$fields[$p]"/>
>> >      <xsl:value-of select="."/>
>> >    </
>> >  </
>> ></
>> ></
>> >
>> >Michael Kay
>> >http://www.saxonica.com/
>> >
>> >
>> >>
>> >> I was thinking of something like
>> >>
>> >> <xsl:for-each select="tokenize(., ',')"> &lt;<xsl:value-of
>> >> select="item-at($elementNames,index-of(?parent of current
>> >> node?,.))"/>&gt; <xsl:value-of select="."/>
>> >> &lt;/<xsl:value-of
>> >> select="item-at($elementNames,index-of(?parent of current
>> >> node?,.))"/>&gt; </xsl:for-each>
>> >>
>> >> where elementNames is a tokenized list of the fieldnames -
>> >> but I am unable to get it to work.
>> >>
>> >>
>> >>
>> >> >-----Original Message-----
>> >> >From: Pantvaidya, Vishwajit
>> >> >Sent: Wednesday, June 21, 2006 12:17 AM
>> >> >To: 'xsl-list@xxxxxxxxxxxxxxxxxxxxxx'
>> >> >Subject: [xsl] Converting CSV to XML without hardcoding
>> >> schema details
>> >> >in xsl
>> >> >
>> >> >Hello,
>> >> >
>> >> >I am trying to convert a CSV datafile into XMl format.
>> >> >The headers for the CSV data are in a file header.csv e.g.
>> >> >Field1,Field2 The data is in a file Data.csv e.g.
>> >> >Value11,Value12
>> >> >Value21,Value22
>> >> >
>> >> >I need to convert the CSV data into xml output by creating
>> >> xml elements
>> >> >using the names in the csv header and taking the
>> >> corresponding values
>> >> >from the data file, so that I get an xml as follows:
>> >> >
>> >> ><doc>
>> >> ><line>
>> >> ><Field1>Value11</Field1>
>> >> ><Field2>Value12</Field2>
>> >> ></line>
>> >> ><line>
>> >> ><Field1>Value21</Field1>
>> >> ><Field2>Value22</Field2>
>> >> ></line>
>> >> ></doc>
>> >> >
>> >> >I was trying to see if I can do this without hardcoding the header
>> >> >names in the xsl. I reached upto the point where my xsl
>> >> looks as below:
>> >> >
>> >> ><xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
>> >> >xmlns:op="http://www.w3.org/2001/12/xquery-operators";
>> >> >    xmlns:xf="http://www.w3.org/2001/12/xquery-functions";
>> >> >version="2.0">
>> >> >
>> >> >    <xsl:output  name="xmlFormat" method="xml" indent="yes"
>> >> >omit-xml-declaration="yes"/>
>> >> >
>> >> >    <xsl:variable name="source1" select="'data.csv'"/>
>> >> >    <xsl:variable name="elementNamesList" select="'Header.csv'"/>
>> >> >    <xsl:variable name="encoding" select="'iso-8859-1'"/>
>> >> >
>> >> >    <xsl:variable name="elementNames"
>> >>
>> >select="tokenize(unparsed-text($elementNamesList,$encoding),',')"/>
>> >> >    <xsl:variable name="src">
>> >> >        <doc>
>> >> >            <xsl:for-each
>> >> >select="tokenize(unparsed-text($source1,$encoding), '\r?\n')">
>> >> >                <line>
>> >> >                    <xsl:for-each select="tokenize(., ',')">
>> >> >                        &lt;<xsl:value-of
>> >> >select="op:item-at($elementNames,index-of(?parent of current
>> >> >node?,.))"/>&gt;
>> >> >                            <xsl:value-of select="."/>
>> >> >                            &lt;/<xsl:value-of
>> >> >select="item-at($elementNames,3)"/>&gt;
>> >> >                    </xsl:for-each>
>> >> >                </line>
>> >> >            </xsl:for-each>
>> >> >        </doc>
>> >> >    </xsl:variable>
>> >> >
>> >> >    <xsl:template match="/">
>> >> >        <xsl:result-document format = "xmlFormat" href =
>> "src1.xml">
>> >> >            <xsl:copy-of select="$src"/>
>> >> >        </xsl:result-document>
>> >> >    </xsl:template>
>> >> >
>> >> ></xsl:stylesheet>
>> >> >
>> >> >In the yet-incomplete statement <xsl:value-of
>> >> >select="op:item-at($elementNames,index-of(?parent of current
>> >> >node?,.))"/>, I am trying to generate an xml element with
>> >> the Nth field
>> >> >name from the headers name list for the Nth field value. Couple of
>> >> >issues/questions here:
>> >> >
>> >> >- I am getting the error "Cannot find a matching
>> 2-argument function
>> >> >named {http://www.w3.org/2001/12/xquery-operators}item-at()"
>> >> when I try
>> >> >to validate the xsl. What could be the reason?
>> >> >
>> >> >- How can I get the ?parent of current node? Needed to compute the
>> >> >index of the current data in the data record?
>> >> >
>> >> >- Is there any other better way to do it? Any way that I
>> can do the
>> >> >same using xsl:element?
>> >> >
>> >> >In general, is this the only/best way or is there any other
>> >> better way
>> >> >to achieve the same goal?
>> >> >
>> >> >
>> >> >Thanks and Regards,
>> >> >
>> >> >Vish.

Current Thread