RE: [xsl] Converting CSV to XML without hardcoding schema details in xsl

Subject: RE: [xsl] Converting CSV to XML without hardcoding schema details in xsl
From: "Pantvaidya, Vishwajit" <vpantvai@xxxxxxxxxxxxx>
Date: Tue, 27 Jun 2006 15:11:10 -0700
Hi Michael,

This is what I got with the following regex:

([^&quot;]*,|&quot;[^&quot;]*&quot;,)+(.*)

The ending (.*) was needed to match the last field which ends with neither a
comma nor quote.

Input CSV(3 lines):

ID,ParentID,Group,User,Title,Description,GroupBelong,EffectiveDate,Effective
Month,EffectiveDay,EffectiveYear,Months,EndDate,Name,AssumedName,Address,Typ
e,Status,Amount,AmountAggregate
1,,,,A BC - A B.
Cloud,Individual,VP,2/13/2006,February,13th,2006,36,2/12/2009,"A B C,
Inc.",D E,"38th Street, MyCity, MyState 12345",TypeA,Active,"$442,000.00
",$1.62 
2,,,,ABC- Judge
ABC,Internal,VP,3/1/2006,March,1st,2006,36,2/28/2009,"Charity Services
(""CS"")",MyCity,"ABC Blvd., MyCity, MyState
12345",TypeB,Active,"$1,442,000.00 ",$1.35


Output XML:

<doc xmlns:xs="http://www.w3.org/2001/XMLSchema";>
   <row>
      <ID>"$442,000.00 ",</ID>
      <ParentID>$1.62 </ParentID>
      <Group/>
      <User/>
      <Title/>
      <Description/>
      <GroupBelong/>
      <EffectiveDate/>
      <EffectiveMonth/>
      <EffectiveDay/>
      <EffectiveYear/>
      <Months/>
      <EndDate/>
      <Name/>
      <AssumedName/>
      <Address/>
      <Type/>
      <Status/>
      <Amount/>
      <AmountAggregate/>
   </row>
   <row>
      <ID>2,,,,ABC- Judge
ABC,Internal,VP,3/1/2006,March,1st,2006,36,2/28/2009,</ID>
      <ParentID>"Charity Services (""CS"")",MyCity,"ABC Blvd., MyCity,
MyState 12345",TypeB,Active,"$1,442,000.00 ",$1.35 </ParentID>
      <Group/>
      <User/>
      <Title/>
      <Description/>
      <GroupBelong/>
      <EffectiveDate/>
      <EffectiveMonth/>
      <EffectiveDay/>
      <EffectiveYear/>
      <Months/>
      <EndDate/>
      <Name/>
      <AssumedName/>
      <Address/>
      <Type/>
      <Status/>
      <Amount/>
      <AmountAggregate/>
   </row>
   <row/>
</doc>



Thanks,

Vish.


>-----Original Message-----
>From: Pantvaidya, Vishwajit
>Sent: Tuesday, June 27, 2006 2:48 PM
>To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
>Subject: RE: [xsl] Converting CSV to XML without hardcoding schema details
>in xsl
>
>>From: Michael Kay [mailto:mike@xxxxxxxxxxxx]
>>Sent: Saturday, June 24, 2006 12:41 AM
>>> >
>>> >There's a lot of potential backtracking here: it might be better to
>>> >replace each "(.*)," with "[^,]*" or with "(.*?),".
>>>
>>> [Pantvaidya, Vishwajit] Does "[^,]*" work the same as "(.*),"
>>> - I understand that ^ is start of line metachar. How does the
>>> former match the alphabet chars?
>>
>>No, within square brackets, ^ means "not". So [^,]* matches a sequence of
>>any characters except comma.
>>
>>The problem with your expression is that (.*) matches as many characters
>as
>>it can. Then it sees ",", so it backtracks to find the last comma. Then it
>>sees the next (.*), and has to backtrack again; and so on.
>>>
>>> >
>>> >My own instinct would be to use something like:
>>> >
>>> >([^"]*,|"[^"]*",)*
>>> >
>>>
>>> [Pantvaidya, Vishwajit] Oxygen would not accept this regex as
>>> "it matches a zero-length string".
>>
>>Perhaps then you want to change the final "*" to a "+".
>>
>[Pantvaidya, Vishwajit] That's is the first thing I tried when the * did
>not
>work - but even then it does not seem to be working.
>
>>> Anyway, how does this regex work - it does not seem to have
>>> anything that matches the alphabet chars.
>>
>>See above: [^"] matches everything except quotes.
>>
>>> And does the ,|" match comma or double quotes - because
>>> actually some field will have both.
>>
>>The first alternative, [^"]*, matches any field that ends with a comma,
>and
>>doesn't contain a quotation mark. The second alternative, "[^"]*,",
>matches
>>any field that begins and ends with quotes (followed by a comma), and
>might
>>contain a comma between the quotes.
>>
>>It's very hard to find out what the exact rules for CSV files used by a
>>particular product are: for example, how it represents a field that
>>contains
>>quotation marks as well as commas. (That's one of the great advantages of
>>XML< you can find a specification!) If you know the exact rules for your
>>particular flavour of CSV, you can adapt the regex to match (well, you can
>>if you study a bit more about regular expressions).
>>>
>>>
>>> Maybe this conversion is easier done with some Java code.
>>>
>>I'm sure it can be done using regular expressions but it looks as if you
>>need to do some learning in this area.
>>
>[Pantvaidya, Vishwajit] Thanks a lot for all the clarifications and help.
>Actually I did look at the regex documentation in the XSLT2 spec, but not
>very exhaustively - the info on back-references I found there made me feel
>that could be potentially useful here e.g. to tell the regex that if a
>starting quote is found, look for an ending one. But the more I look into
>it, the more it seems like I maynot be able to use it.
>
>Thanks and regards,
>
>Vish.

Current Thread