Subject: RE: [xsl] Converting CSV to XML without hardcoding schema details in xsl From: "Michael Kay" <mike@xxxxxxxxxxxx> Date: Fri, 23 Jun 2006 09:39:32 +0100 |
> My CSV has some commas in some cells - in those cases the > entire cell value is itself enclosed in quotes. So a simple > tokenize that splits at comma boundaries would not work - so > I replaced the tokenize for the cells with a regex that took > care of the quotes (is there any alternative here other than > using regex?). I had to specify the quotes in the regex as > " After this, it started taking 45 minutes to transform > a 20 columns-35 rows CSV. Are you using Saxon? Performance information is only interesting if we know what processor you are using. > > Next problem I found was that for columns that contain commas > in the value, all cells in that column are not enclosed in > quotes - only those cells that actually have commas are > enclosed in quotes. So I changed the regex to account for > 0/more quotes. Now it transformed in 45 secs - surprise? > But even now, I see that the 0/more quotes regex throws it > off and the csv gets incorrectly parsed resulting in the > wrong xml content. > > So I made some changes and the current xsl has the regex as: > <xsl:analyze-string select="." > regex="(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*), > (.*),(.*),&quo > t;*(.*)"*,(.*),"*(.*)"*,(.*),(.*),"*($.*)& > quot;*,(.*)"> There's a lot of potential backtracking here: it might be better to replace each "(.*)," with "[^,]*" or with "(.*?),". My own instinct would be to use something like: ([^"]*,|"[^"]*",)* Michael Kay Saxonica
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
RE: [xsl] Converting CSV to XML wit, Pantvaidya, Vishwaji | Thread | RE: [xsl] Converting CSV to XML wit, Nathan Young -X \(na |
RE: [xsl] Converting CSV to XML wit, Michael Kay | Date | [xsl] retrieving mode name, Georg Hohmann |
Month |