Andrew Welch wrote:
er, it's from me :)
Ah, well, it is just that I saw you most often on the answering side ;)
and I didn't take the time to lookup the OP, I just broke into the thread.
replace($input, ' ^"|"$" ', '', 'x')
which will remove opening/closing quotes.
Cool, thanks. In csv quotes in quoted values are escaped by doubling
the quotes, so just removing the opening/closing quotes is the way to
go. (In my specific requirement though I can guarantee there's no
escaped quotes so translate would've been fine)
About the regex: the 'x' is not necessary if you remove my redundant
whitespace in it.
Here's a resource on regexes: http://www.regular-expressions.info/.
I've seen some regexes coming by on parsing the CSV format, to which
this comes close. The RFC-4180 is an official description of it, though
what I understand, your input is not that strict.
But I feel inclined to add something about parsing CSV. As you pointed
out, it looks simple, but in fact is not. Some solutions passed already,
but I think that one of the easiest solutions I found, was one from Owen
Rees, a couple of months back, in response to an exponentially
performing regex (which is ugly trap to fall in and depends highly on
On a side note, one of the last messages in this thread contained the
following regex, though it's a beauty and a correct regex, it cannot be
used in XSLT because it matches an empty string, which is not allowed:
Instead of thinking in matching everything at once, you can make it
easier by calling both matching-substring and non-matching-substring to
your aid (too often people tend to do everything in matching-substring):
<elem>"foo, bar", baz, "", bom, """", "foo "","" quoted"</elem>
Must become (note the empty one and the singled quotes):
<token>foo "," quoted</token>
all we have to do is remove the trailing/leading quotes and make the ""
into a single quote, i.e.:
The core regex (by Owen Rees): ("[^"]*")+
regex (not that easy to follow): ^"|"$|(")"
in this replace: replace(., "^""|""$|("")""", "$1")
will do the trick.
What is not matched, is part of your non-quoted fields. The rest becomes
easy. Non-quoted CSV can be parsed by using tokenize():
You may want to get rid of the whitespace that is trailing/leading on a
field, but not when quoted (in which case it should stay). Change the
Altogether this becomes:
<xsl:analyze-string regex='("[^"]*")+' select="$input/elem" >
<xsl:value-of select='replace(., "^""|""$|("")""", "$1")'/>
<xsl:for-each select="tokenize(., '\s*,\s*')">
This will introduce some empty nodes, but there are ample ways to get
rid of them. To only remove empty fields that are introduced by the
rather rudimentary nature of the above, replace the tokenize with this:
tokenize(replace(., '^?\s*,\s*$?', ''), '\s*,\s*')
but that won't add to readability of course (but hey, these are regexes,
there beauty is in their power, not in their looks). Other empty nodes
will be the result of tokens that actually *are* there and should be empty.
I know you didn't ask for a full RFC covering (it works with quoted
newlines, too), but since I had it on the shelf, I thought I'd share it
anyway. And you simplify it by removing the bits that involve double
A side benefit, it is tested with faulty input and 'correctly' deals
with it. This means: you can't know for sure what it does, but it does
it fast and it won't run into an exponential performance.
-- Abel Braaksma