RE: [xsl] Converting CSV to XML without hardcoding schema details in xsl

Subject: RE: [xsl] Converting CSV to XML without hardcoding schema details in xsl
From: "Pantvaidya, Vishwajit" <vpantvai@xxxxxxxxxxxxx>
Date: Fri, 23 Jun 2006 15:25:08 -0700
>-----Original Message-----
>From: Michael Kay [mailto:mike@xxxxxxxxxxxx]
>
>> My CSV has some commas in some cells - in those cases the
>> entire cell value is itself enclosed in quotes. So a simple
>> tokenize that splits at comma boundaries would not work - so
>> I replaced the tokenize for the cells with a regex that took
>> care of the quotes (is there any alternative here other than
>> using regex?). I had to specify the quotes in the regex as
>> &quot; After this, it started taking 45 minutes to transform
>> a 20 columns-35 rows CSV.
>
>Are you using Saxon? Performance information is only interesting if we know
>what processor you are using.
[Pantvaidya, Vishwajit] Yes, I am using oxygen as editor which is using
Saxon8B.

>>
>> Next problem I found was that for columns that contain commas
>> in the value, all cells in that column are not enclosed in
>> quotes - only those cells that actually have commas are
>> enclosed in quotes. So I changed the regex to account for
>> 0/more quotes. Now it transformed in 45 secs - surprise?
>> But even now, I see that the 0/more quotes regex throws it
>> off and the csv gets incorrectly parsed resulting in the
>> wrong xml content.
>>
>> So I made some changes and the current xsl has the regex as:
>> <xsl:analyze-string select="."
>> regex="(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),
>> (.*),(.*),&quo
>> t;*(.*)&quot;*,(.*),&quot;*(.*)&quot;*,(.*),(.*),&quot;*($.*)&
>> quot;*,(.*)">
>
>There's a lot of potential backtracking here: it might be better to replace
>each "(.*)," with "[^,]*" or with "(.*?),".

[Pantvaidya, Vishwajit] Does "[^,]*" work the same as "(.*)," - I understand
that ^ is start of line metachar. How does the former match the alphabet
chars?

>
>My own instinct would be to use something like:
>
>([^"]*,|"[^"]*",)*
>

[Pantvaidya, Vishwajit] Oxygen would not accept this regex as "it matches a
zero-length string".
Anyway, how does this regex work - it does not seem to have anything that
matches the alphabet chars.
And does the ,|" match comma or double quotes - because actually some field
will have both.

Generally, it seems that the problems with transforming such CSVs where the
field names may themselves have commas, maybe due to there being no way to
- remember current state (e.g. opening double quotes) and match the
remaining string based on knowledge of that state i.e. something like "if
opening double quotes encountered, then continue matching chars till closing
double quote, else match till next comma" or
- assign priority to specific matches over others e.g. give preference to
matching quotes if found over commas.

Maybe this conversion is easier done with some Java code.


Thanks a lot Michael for all your help...


Vish.

Current Thread