Re: [xsl] How to tokenize a comma-separated CSV record which has a field containing a string that has commas?

Subject: Re: [xsl] How to tokenize a comma-separated CSV record which has a field containing a string that has commas?
From: "C. M. Sperberg-McQueen cmsmcq@xxxxxxxxxxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Thu, 4 Aug 2022 21:32:17 -0000
"Michael Kay mike@xxxxxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> writes:

> A good case for Invisible XML (topic of the week at Balisage), but
> I'll leave someone else to flesh it out.

Invitation accepted.  

>  On 04.08.2022 20:53, Roger L Costello costello@xxxxxxxxx wrote:

> ... I want to tokenize this:

> airport,AeroPublication/airports/airport,ARPT_IDENT,12,ARPT.TXT,ARPT,

> into these 7 tokens:

> 1. airport
> 2. AeroPublication/airports/airport
> 3. ARPT_IDENT
> 4. 12
> 5. ARPT.TXT
> 6. ARPT
> 7. ''    /* empty string */

> And tokenize this:

> cycleDate,AeroPublication/airports/airport/cycleDate,CYCLE_DATE,59,ARPT.TXT,ARPT,"substring($ARPT_row/CYCLE__DATE, 3)"

> into these 7 tokens:

> 1. cycleDate
> 2. AeroPublication/airports/airport/cycleDate
> 3. CYCLE_DATE
> 4. 59
> 5. ARPT.TXT
> 6. APRT
> 7. substring($ARPT_row/CYCLE__DATE, 3)   /* bonus points if you can also remove the surrounding quote symbols) */

> Clearly this isn't the solution:
>
>       tokenize(. ',')
>
> as it erroneously breaks apart the last field (string containing commas).

Depending on the data sources producing the CSV you need to parse, there
is likely to be an indefinitely long series of complications, so take
what follows with a grain of salt.  But for comma-separated values in
which every value is either an unquoted string containing no commas and
no newlines, or a double-quoted string possibly containing commas,
newlines, and/or doubled double quotation marks), a grammar like this
will produce the tokenization you describe:

         data: line ++ newline.
         line: field ++ -','.
        field: quoted; unquoted.
      -quoted:  -#22, (~[#22]; (-#22, #22))*, -#22.
    -unquoted: null; ~[#22; ","; #A; #D], ~[","; #A; #D]*.
        -null: {} +"''". 
     -newline: #D?, #A.

Or, with comments and explanations: the incoming data stream to be
parsed is a sequence of one or more lines separated by newlines (see
below).

         data: line ++ newline.

A line is one or more fields separated by commas.  An empty line is thus
understood as containing one empty field.

         line: field ++ -','.

A field is either a quoted field or an unquoted field.

        field: quoted; unquoted.

A quoted field begins and ends with a quotation mark (which, for
legibility and to spare my Emacs mode the trauma of unmarked quotation
marks, I write using its hex value: double quotation mark is U+0022,
which we can write #22), and in the middle contains by zero or more
occurrences of

  - any character which is not a double quotation mark  
  - a pair of double quotation marks, only one of which is copied to the
    output

The double quotation marks at the beginning and ending of the quoted
field are suppressed (by writing a minus sign in front of the '#22').

      -quoted:  -#22, (~[#22]; (-#22, #22))*, -#22.

An unquoted field is either a null field (see below) or a sequence of
one or more characters which can be any character other than a comma, a
carriage return, or a linefeed; the first character must not be a double
quotation mark.

    -unquoted: null; ~[#22; ","; #A; #D], ~[","; #A; #D]*.

The description of the desired output shows an empty field being
tokenized as a pair of apostrophes; that seems unusual, but if that's
what is required, we can produce it.  We do so by distinguishing null
fields as a special subcase of unquoted fields.  In the input, a null
field is the empty string.  When we recognize one, we insert two
apostrophes into the output.

        -null: {} +"''". 

A newline is either a linefeed or a carriage-return + linefeed pair.

     -newline: #D?, #A.

It would be great, I think, if anyone could take the time to show how a
Daffodil (DFDL) processor could be used for this tokenization task.


-- 
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
http://blackmesatech.com

Current Thread