RE: [xsl] Converting CSV to XML without hardcoding schema details in xsl

Subject: RE: [xsl] Converting CSV to XML without hardcoding schema details in xsl
From: "Michael Kay" <mike@xxxxxxxxxxxx>
Date: Sat, 24 Jun 2006 08:41:06 +0100
> >There's a lot of potential backtracking here: it might be better to 
> >replace each "(.*)," with "[^,]*" or with "(.*?),".
> 
> [Pantvaidya, Vishwajit] Does "[^,]*" work the same as "(.*)," 
> - I understand that ^ is start of line metachar. How does the 
> former match the alphabet chars?

No, within square brackets, ^ means "not". So [^,]* matches a sequence of
any characters except comma.

The problem with your expression is that (.*) matches as many characters as
it can. Then it sees ",", so it backtracks to find the last comma. Then it
sees the next (.*), and has to backtrack again; and so on.
> 
> >
> >My own instinct would be to use something like:
> >
> >([^"]*,|"[^"]*",)*
> >
> 
> [Pantvaidya, Vishwajit] Oxygen would not accept this regex as 
> "it matches a zero-length string".

Perhaps then you want to change the final "*" to a "+".

> Anyway, how does this regex work - it does not seem to have 
> anything that matches the alphabet chars.

See above: [^"] matches everything except quotes.

> And does the ,|" match comma or double quotes - because 
> actually some field will have both.

The first alternative, [^"]*, matches any field that ends with a comma, and
doesn't contain a quotation mark. The second alternative, "[^"]*,", matches
any field that begins and ends with quotes (followed by a comma), and might
contain a comma between the quotes.

It's very hard to find out what the exact rules for CSV files used by a
particular product are: for example, how it represents a field that contains
quotation marks as well as commas. (That's one of the great advantages of
XML< you can find a specification!) If you know the exact rules for your
particular flavour of CSV, you can adapt the regex to match (well, you can
if you study a bit more about regular expressions).
> 
> 
> Maybe this conversion is easier done with some Java code.
> 
I'm sure it can be done using regular expressions but it looks as if you
need to do some learning in this area.

Michael Kay
http://www.saxonica.com/

Current Thread