Subject: RE: [xsl] Converting CSV to XML without hardcoding schema details in xsl From: "Pantvaidya, Vishwajit" <vpantvai@xxxxxxxxxxxxx> Date: Fri, 23 Jun 2006 15:25:08 -0700 |
>-----Original Message----- >From: Michael Kay [mailto:mike@xxxxxxxxxxxx] > >> My CSV has some commas in some cells - in those cases the >> entire cell value is itself enclosed in quotes. So a simple >> tokenize that splits at comma boundaries would not work - so >> I replaced the tokenize for the cells with a regex that took >> care of the quotes (is there any alternative here other than >> using regex?). I had to specify the quotes in the regex as >> " After this, it started taking 45 minutes to transform >> a 20 columns-35 rows CSV. > >Are you using Saxon? Performance information is only interesting if we know >what processor you are using. [Pantvaidya, Vishwajit] Yes, I am using oxygen as editor which is using Saxon8B. >> >> Next problem I found was that for columns that contain commas >> in the value, all cells in that column are not enclosed in >> quotes - only those cells that actually have commas are >> enclosed in quotes. So I changed the regex to account for >> 0/more quotes. Now it transformed in 45 secs - surprise? >> But even now, I see that the 0/more quotes regex throws it >> off and the csv gets incorrectly parsed resulting in the >> wrong xml content. >> >> So I made some changes and the current xsl has the regex as: >> <xsl:analyze-string select="." >> regex="(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*), >> (.*),(.*),&quo >> t;*(.*)"*,(.*),"*(.*)"*,(.*),(.*),"*($.*)& >> quot;*,(.*)"> > >There's a lot of potential backtracking here: it might be better to replace >each "(.*)," with "[^,]*" or with "(.*?),". [Pantvaidya, Vishwajit] Does "[^,]*" work the same as "(.*)," - I understand that ^ is start of line metachar. How does the former match the alphabet chars? > >My own instinct would be to use something like: > >([^"]*,|"[^"]*",)* > [Pantvaidya, Vishwajit] Oxygen would not accept this regex as "it matches a zero-length string". Anyway, how does this regex work - it does not seem to have anything that matches the alphabet chars. And does the ,|" match comma or double quotes - because actually some field will have both. Generally, it seems that the problems with transforming such CSVs where the field names may themselves have commas, maybe due to there being no way to - remember current state (e.g. opening double quotes) and match the remaining string based on knowledge of that state i.e. something like "if opening double quotes encountered, then continue matching chars till closing double quote, else match till next comma" or - assign priority to specific matches over others e.g. give preference to matching quotes if found over commas. Maybe this conversion is easier done with some Java code. Thanks a lot Michael for all your help... Vish.
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
RE: [xsl] Converting CSV to XML wit, Pantvaidya, Vishwaji | Thread | RE: [xsl] Converting CSV to XML wit, Michael Kay |
[xsl] returning nodes (not a string, Steve | Date | Re: [xsl] returning nodes (not a st, Dimitre Novatchev |
Month |