Re: [xsl] How to tokenize a comma-separated CSV record which has a field containing a string that has commas?

Subject: Re: [xsl] How to tokenize a comma-separated CSV record which has a field containing a string that has commas?
From: "Michael Kay mike@xxxxxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Thu, 4 Aug 2022 19:59:32 -0000
A good case for Invisible XML (topic of the week at Balisage), but I'll leave
someone else to flesh it out.

Michael Kay
Saxonica

> On 4 Aug 2022, at 20:45, Martin Honnen martin.honnen@xxxxxx
<xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote:
>
>
>
> On 04.08.2022 20:53, Roger L Costello costello@xxxxxxxxx
<mailto:costello@xxxxxxxxx> wrote:
>> Hi Folks,
>>
>> I'm stuck.
>>
>> I want to tokenize this:
>>
>> airport,AeroPublication/airports/airport,ARPT_IDENT,12,ARPT.TXT,ARPT,
>>
>> into these 7 tokens:
>>
>> 1. airport
>> 2. AeroPublication/airports/airport
>> 3. ARPT_IDENT
>> 4. 12
>> 5. ARPT.TXT
>> 6. ARPT
>> 7. ''    /* empty string */
>>
>> And tokenize this:
>>
>>
cycleDate,AeroPublication/airports/airport/cycleDate,CYCLE_DATE,59,ARPT.TXT,A
RPT,"substring($ARPT_row/CYCLE__DATE, 3)"
>>
>> into these 7 tokens:
>>
>> 1. cycleDate
>> 2. AeroPublication/airports/airport/cycleDate
>> 3. CYCLE_DATE
>> 4. 59
>> 5. ARPT.TXT
>> 6. APRT
>> 7. substring($ARPT_row/CYCLE__DATE, 3)   /* bonus points if you can also
remove the surrounding quote symbols) */
>>
>> Clearly this isn't the solution:
>>
>>       tokenize(. ',')
>>
>> as it erroneously breaks apart the last field (string containing commas).
>>
>> Suggestions?
>>
>
> If you want to do it with regular expression splitting or tokenizing on a
delimiter I think most articles suggest you need a lookahead, something not
supported by pure XPath regular expressions but easily used in Saxon and Java
doing e.g.
>
>
>
>
tokenize('cycleDate,AeroPublication/airports/airport/cycleDate,CYCLE_DATE,59,
ARPT.TXT,ARPT,"substring($ARPT_row/CYCLE__DATE, 3)"',
',(?=(?:[^"]*"[^"]*")*[^"]*$)', ';j')
>
>
>
> Expression taken from https://www.baeldung.com/java-split-string-commas
<https://www.baeldung.com/java-split-string-commas> explaining "using positive
lookahead <https://www.baeldung.com/java-regex-lookahead-lookbehind>, tells to
split around a comma only if there are no double quotes or if there is an even
number of double quotes ahead of it."
> XSL-List info and archive <http://www.mulberrytech.com/xsl/xsl-list>
> EasyUnsubscribe <http://lists.mulberrytech.com/unsub/xsl-list/293509> (by
email <>)

Current Thread