Re: [xsl] How to tokenize a comma-separated CSV record which has a field containing a string that has commas?

Subject: Re: [xsl] How to tokenize a comma-separated CSV record which has a field containing a string that has commas?
From: "Martin Honnen martin.honnen@xxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Thu, 4 Aug 2022 19:45:34 -0000
On 04.08.2022 20:53, Roger L Costello costello@xxxxxxxxx wrote:
> Hi Folks,
>
> I'm stuck.
>
> I want to tokenize this:
>
> airport,AeroPublication/airports/airport,ARPT_IDENT,12,ARPT.TXT,ARPT,
>
> into these 7 tokens:
>
> 1. airport
> 2. AeroPublication/airports/airport
> 3. ARPT_IDENT
> 4. 12
> 5. ARPT.TXT
> 6. ARPT
> 7. ''    /* empty string */
>
> And tokenize this:
>
>
cycleDate,AeroPublication/airports/airport/cycleDate,CYCLE_DATE,59,ARPT.TXT,A
RPT,"substring($ARPT_row/CYCLE__DATE, 3)"
>
> into these 7 tokens:
>
> 1. cycleDate
> 2. AeroPublication/airports/airport/cycleDate
> 3. CYCLE_DATE
> 4. 59
> 5. ARPT.TXT
> 6. APRT
> 7. substring($ARPT_row/CYCLE__DATE, 3)   /* bonus points if you can also
remove the surrounding quote symbols) */
>
> Clearly this isn't the solution:
>
>        tokenize(. ',')
>
> as it erroneously breaks apart the last field (string containing commas).
>
> Suggestions?
>

If you want to do it with regular expression splitting or tokenizing on
a delimiter I think most articles suggest you need a lookahead,
something not supported by pure XPath regular expressions but easily
used in Saxon and Java doing e.g.


tokenize('cycleDate,AeroPublication/airports/airport/cycleDate,CYCLE_DATE,59,
ARPT.TXT,ARPT,"substring($ARPT_row/CYCLE__DATE,
3)"', ',(?=(?:[^"]*"[^"]*")*[^"]*$)', ';j')


Expression taken from https://www.baeldung.com/java-split-string-commas
explaining "*usingpositive lookahead
<https://www.baeldung.com/java-regex-lookahead-lookbehind>, tells to
split around a comma only if there are no double quotes or if there is
an even number of double quotes ahead of it."*

Current Thread