Re: [xsl] tokenizing comma separated string with quotes

Subject: Re: [xsl] tokenizing comma separated string with quotes
From: Abel Braaksma <abel.online@xxxxxxxxx>
Date: Thu, 22 Feb 2007 17:20:31 +0100
Andrew Welch wrote:

er, it's from me :)

Ah, well, it is just that I saw you most often on the answering side ;) and I didn't take the time to lookup the OP, I just broke into the thread.



replace($input, ' ^"|"$" ', '', 'x')

which will remove opening/closing quotes.

Cool, thanks. In csv quotes in quoted values are escaped by doubling the quotes, so just removing the opening/closing quotes is the way to go. (In my specific requirement though I can guarantee there's no escaped quotes so translate would've been fine)

About the regex: the 'x' is not necessary if you remove my redundant whitespace in it.


Here's a resource on regexes: http://www.regular-expressions.info/.

I've seen some regexes coming by on parsing the CSV format, to which this comes close. The RFC-4180 is an official description of it, though what I understand, your input is not that strict.

But I feel inclined to add something about parsing CSV. As you pointed out, it looks simple, but in fact is not. Some solutions passed already, but I think that one of the easiest solutions I found, was one from Owen Rees, a couple of months back, in response to an exponentially performing regex (which is ugly trap to fall in and depends highly on the implementation).

On a side note, one of the last messages in this thread contained the following regex, though it's a beauty and a correct regex, it cannot be used in XSLT because it matches an empty string, which is not allowed:
\s*"?([^"]*"?|[^,]+)\s*



Instead of thinking in matching everything at once, you can make it easier by calling both matching-substring and non-matching-substring to your aid (too often people tend to do everything in matching-substring):


Input file:
<elem>"foo, bar", baz, "", bom, """", "foo "","" quoted"</elem>

Must become (note the empty one and the singled quotes):
<token>foo, bar</token>
<token>baz</token>
<token/>
<token>bom</token>
<token>"</token>
<token>foo "," quoted</token>


The core regex (by Owen Rees): ("[^"]*")+ this matches: "foo, bar" "" """" "foo"","" quoted"

all we have to do is remove the trailing/leading quotes and make the "" into a single quote, i.e.:
regex (not that easy to follow): ^"|"$|(")"
in this replace: replace(., "^""|""$|("")""", "$1")
will do the trick.


What is not matched, is part of your non-quoted fields. The rest becomes easy. Non-quoted CSV can be parsed by using tokenize():
tokenize($input, ',')


You may want to get rid of the whitespace that is trailing/leading on a field, but not when quoted (in which case it should stay). Change the tokenizer such:
tokenize($input, '\s*,\s*')


Altogether this becomes:

<xsl:analyze-string regex='("[^"]*")+' select="$input/elem[1]" >
   <xsl:matching-substring>
       <token>
           <xsl:value-of select='replace(., "^""|""$|("")""", "$1")'/>
       </token>
   </xsl:matching-substring>
   <xsl:non-matching-substring>
       <xsl:for-each select="tokenize(., '\s*,\s*')">
           <token><xsl:value-of select="."/></token>
       </xsl:for-each>
   </xsl:non-matching-substring>
</xsl:analyze-string>


This will introduce some empty nodes, but there are ample ways to get rid of them. To only remove empty fields that are introduced by the rather rudimentary nature of the above, replace the tokenize with this:


tokenize(replace(., '^?\s*,\s*$?', ''), '\s*,\s*')

but that won't add to readability of course (but hey, these are regexes, there beauty is in their power, not in their looks). Other empty nodes will be the result of tokens that actually *are* there and should be empty.

I know you didn't ask for a full RFC covering (it works with quoted newlines, too), but since I had it on the shelf, I thought I'd share it anyway. And you simplify it by removing the bits that involve double quote escapes.

A side benefit, it is tested with faulty input and 'correctly' deals with it. This means: you can't know for sure what it does, but it does it fast and it won't run into an exponential performance.

Cheers,
-- Abel Braaksma

Current Thread