RE: [xsl] Converting CSV to XML without hardcoding schema details in xsl

Subject: RE: [xsl] Converting CSV to XML without hardcoding schema details in xsl
From: "Nathan Young -X \(natyoung - Artizen at Cisco\)" <natyoung@xxxxxxxxx>
Date: Fri, 23 Jun 2006 09:52:51 -0700
Hi.

The rules you describe for handling cells with commas in them and cells
with quotes in them are widely used conventions for encoding data in
csv.  Unless you are able to prevent cells from ever containing commas
or quotes you will not be able to make the csv "uniform" in a way that
does not require these (or some other) irregularities.

There is another way of parsing csv files that works faster than regular
expressions, very generally by reading the file character by character
into a buffer and applying a set of rules at each character to decide if
you have reached the end of a cell, at which point you empty the buffer
into a cell variable (or whatever you need to do with it) and continue.
I think this is best not done in XSL though.

If performance is indeed an issue, you are likely to be well served by
parsing out the csv file into a very simple XML format using another
language.  Many existing programming languages have very robust and
performant csv parsers for them already, so you'd have that problem
mostly solved from the outset.

------------>Nathan



.:||:._.:||:._.:||:._.:||:._.:||:._.:||:._.:||:._.:||:._.:||:._.:||:._.:
||:.

Nathan Young
Cisco.com->Interface Development
A: ncy1717
E: natyoung@xxxxxxxxx

> -----Original Message-----
> From: Pantvaidya, Vishwajit [mailto:vpantvai@xxxxxxxxxxxxx]
> Sent: Thursday, June 22, 2006 8:51 PM
> To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
> Subject: RE: [xsl] Converting CSV to XML without hardcoding
> schema details in xsl
>
> Thanks a lot for the xsl, Michael.
>
> My CSV has some commas in some cells - in those cases the
> entire cell value
> is itself enclosed in quotes. So a simple tokenize that
> splits at comma
> boundaries would not work - so I replaced the tokenize for
> the cells with a
> regex that took care of the quotes (is there any alternative
> here other than
> using regex?). I had to specify the quotes in the regex as &quot;
> After this, it started taking 45 minutes to transform a 20
> columns-35 rows
> CSV.
>
> Next problem I found was that for columns that contain commas
> in the value,
> all cells in that column are not enclosed in quotes - only
> those cells that
> actually have commas are enclosed in quotes. So I changed the regex to
> account for 0/more quotes. Now it transformed in 45 secs - surprise?
> But even now, I see that the 0/more quotes regex throws it
> off and the csv
> gets incorrectly parsed resulting in the wrong xml content.
>
> So I made some changes and the current xsl has the regex as:
> <xsl:analyze-string select="."
> regex="(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),
(.*),(.*),&quo
> t;*(.*)&quot;*,(.*),&quot;*(.*)&quot;*,(.*),(.*),&quot;*($.*)&
quot;*,(.*)">
>
> (now it is taking even more time - 1hour+ and still not done.
> Lets see if
> atleast the xml comes out correctly.)
>
> Any suggestions to mitigate these regex complexity due to
> non-uniformity of
> input CSV?
>
> Or am I am better off asking the CSV provider of the CSV to
> keep the CSV
> uniform so that either all cells in the column are
> with/without quotes?
>
>
> Thanks,
>
> Vish.
>
> >-----Original Message-----
> >From: Michael Kay [mailto:mike@xxxxxxxxxxxx]
> >Sent: Thursday, June 22, 2006 12:43 AM
> >To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
> >Subject: RE: [xsl] Converting CSV to XML without hardcoding
> schema details
> >in xsl
> >
> >> Can anybody suggest how to convert CSV data in the format
> >>
> >> Field1,Field2
> >> Value11,Value12
> >>
> >> to xml like
> >>
> >> <Field1>Value11</Field1>
> >> <Field2>Value12</Field2>
> >>
> >> without hardcoding the fieldnames in the xsl?
> >
> ><xsl:variable name="lines" as="xs:string*"
> >              select="tokenize(unparsed-text($input-file,
> '\r?\n'"))"/>
> ><xsl:variable name="field-names as="xs:string*"
> >              select="tokenize($lines[1], ',')"/>
> ><xsl:for-each select="subsequence($lines,2)">
> ><row>
> >  <xsl:variable name="cells" select="tokenize(., ',')"/>
> >  <xsl:for-each select="$cells">
> >    <xsl:variable name="p" as="xs:integer" select="position()"/>
> >    <xsl:element name="$fields[$p]"/>
> >      <xsl:value-of select="."/>
> >    </
> >  </
> ></
> ></
> >
> >Michael Kay
> >http://www.saxonica.com/
> >
> >
> >>
> >> I was thinking of something like
> >>
> >> <xsl:for-each select="tokenize(., ',')"> &lt;<xsl:value-of
> >> select="item-at($elementNames,index-of(?parent of current
> >> node?,.))"/>&gt; <xsl:value-of select="."/>
> >> &lt;/<xsl:value-of
> >> select="item-at($elementNames,index-of(?parent of current
> >> node?,.))"/>&gt; </xsl:for-each>
> >>
> >> where elementNames is a tokenized list of the fieldnames -
> >> but I am unable to get it to work.
> >>
> >>
> >>
> >> >-----Original Message-----
> >> >From: Pantvaidya, Vishwajit
> >> >Sent: Wednesday, June 21, 2006 12:17 AM
> >> >To: 'xsl-list@xxxxxxxxxxxxxxxxxxxxxx'
> >> >Subject: [xsl] Converting CSV to XML without hardcoding
> >> schema details
> >> >in xsl
> >> >
> >> >Hello,
> >> >
> >> >I am trying to convert a CSV datafile into XMl format.
> >> >The headers for the CSV data are in a file header.csv e.g.
> >> >Field1,Field2 The data is in a file Data.csv e.g.
> >> >Value11,Value12
> >> >Value21,Value22
> >> >
> >> >I need to convert the CSV data into xml output by creating
> >> xml elements
> >> >using the names in the csv header and taking the
> >> corresponding values
> >> >from the data file, so that I get an xml as follows:
> >> >
> >> ><doc>
> >> ><line>
> >> ><Field1>Value11</Field1>
> >> ><Field2>Value12</Field2>
> >> ></line>
> >> ><line>
> >> ><Field1>Value21</Field1>
> >> ><Field2>Value22</Field2>
> >> ></line>
> >> ></doc>
> >> >
> >> >I was trying to see if I can do this without hardcoding the header
> >> >names in the xsl. I reached upto the point where my xsl
> >> looks as below:
> >> >
> >> ><xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
> >> >xmlns:op="http://www.w3.org/2001/12/xquery-operators";
> >> >    xmlns:xf="http://www.w3.org/2001/12/xquery-functions";
> >> >version="2.0">
> >> >
> >> >    <xsl:output  name="xmlFormat" method="xml" indent="yes"
> >> >omit-xml-declaration="yes"/>
> >> >
> >> >    <xsl:variable name="source1" select="'data.csv'"/>
> >> >    <xsl:variable name="elementNamesList" select="'Header.csv'"/>
> >> >    <xsl:variable name="encoding" select="'iso-8859-1'"/>
> >> >
> >> >    <xsl:variable name="elementNames"
> >>
> >select="tokenize(unparsed-text($elementNamesList,$encoding),',')"/>
> >> >    <xsl:variable name="src">
> >> >        <doc>
> >> >            <xsl:for-each
> >> >select="tokenize(unparsed-text($source1,$encoding), '\r?\n')">
> >> >                <line>
> >> >                    <xsl:for-each select="tokenize(., ',')">
> >> >                        &lt;<xsl:value-of
> >> >select="op:item-at($elementNames,index-of(?parent of current
> >> >node?,.))"/>&gt;
> >> >                            <xsl:value-of select="."/>
> >> >                            &lt;/<xsl:value-of
> >> >select="item-at($elementNames,3)"/>&gt;
> >> >                    </xsl:for-each>
> >> >                </line>
> >> >            </xsl:for-each>
> >> >        </doc>
> >> >    </xsl:variable>
> >> >
> >> >    <xsl:template match="/">
> >> >        <xsl:result-document format = "xmlFormat" href =
> "src1.xml">
> >> >            <xsl:copy-of select="$src"/>
> >> >        </xsl:result-document>
> >> >    </xsl:template>
> >> >
> >> ></xsl:stylesheet>
> >> >
> >> >In the yet-incomplete statement <xsl:value-of
> >> >select="op:item-at($elementNames,index-of(?parent of current
> >> >node?,.))"/>, I am trying to generate an xml element with
> >> the Nth field
> >> >name from the headers name list for the Nth field value. Couple of
> >> >issues/questions here:
> >> >
> >> >- I am getting the error "Cannot find a matching
> 2-argument function
> >> >named {http://www.w3.org/2001/12/xquery-operators}item-at()"
> >> when I try
> >> >to validate the xsl. What could be the reason?
> >> >
> >> >- How can I get the ?parent of current node? Needed to compute the
> >> >index of the current data in the data record?
> >> >
> >> >- Is there any other better way to do it? Any way that I
> can do the
> >> >same using xsl:element?
> >> >
> >> >In general, is this the only/best way or is there any other
> >> better way
> >> >to achieve the same goal?
> >> >
> >> >
> >> >Thanks and Regards,
> >> >
> >> >Vish.

Current Thread