RE: [xsl] Converting CSV to XML without hardcoding schema details in xsl

Subject: RE: [xsl] Converting CSV to XML without hardcoding schema details in xsl
From: "Michael Kay" <mike@xxxxxxxxxxxx>
Date: Fri, 23 Jun 2006 09:39:32 +0100
> My CSV has some commas in some cells - in those cases the 
> entire cell value is itself enclosed in quotes. So a simple 
> tokenize that splits at comma boundaries would not work - so 
> I replaced the tokenize for the cells with a regex that took 
> care of the quotes (is there any alternative here other than 
> using regex?). I had to specify the quotes in the regex as 
> &quot; After this, it started taking 45 minutes to transform 
> a 20 columns-35 rows CSV.

Are you using Saxon? Performance information is only interesting if we know
what processor you are using.
> 
> Next problem I found was that for columns that contain commas 
> in the value, all cells in that column are not enclosed in 
> quotes - only those cells that actually have commas are 
> enclosed in quotes. So I changed the regex to account for 
> 0/more quotes. Now it transformed in 45 secs - surprise?
> But even now, I see that the 0/more quotes regex throws it 
> off and the csv gets incorrectly parsed resulting in the 
> wrong xml content.
> 
> So I made some changes and the current xsl has the regex as:
> <xsl:analyze-string select="."
> regex="(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),
> (.*),(.*),&quo
> t;*(.*)&quot;*,(.*),&quot;*(.*)&quot;*,(.*),(.*),&quot;*($.*)&
> quot;*,(.*)">

There's a lot of potential backtracking here: it might be better to replace
each "(.*)," with "[^,]*" or with "(.*?),".

My own instinct would be to use something like:

([^"]*,|"[^"]*",)*

Michael Kay
Saxonica

Current Thread