Subject: Re: [xsl] How to tokenize a comma-separated CSV record which has a field containing a string that has commas? From: "C. M. Sperberg-McQueen cmsmcq@xxxxxxxxxxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> Date: Thu, 4 Aug 2022 21:32:17 -0000 |
"Michael Kay mike@xxxxxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> writes: > A good case for Invisible XML (topic of the week at Balisage), but > I'll leave someone else to flesh it out. Invitation accepted. > On 04.08.2022 20:53, Roger L Costello costello@xxxxxxxxx wrote: > ... I want to tokenize this: > airport,AeroPublication/airports/airport,ARPT_IDENT,12,ARPT.TXT,ARPT, > into these 7 tokens: > 1. airport > 2. AeroPublication/airports/airport > 3. ARPT_IDENT > 4. 12 > 5. ARPT.TXT > 6. ARPT > 7. '' /* empty string */ > And tokenize this: > cycleDate,AeroPublication/airports/airport/cycleDate,CYCLE_DATE,59,ARPT.TXT,ARPT,"substring($ARPT_row/CYCLE__DATE, 3)" > into these 7 tokens: > 1. cycleDate > 2. AeroPublication/airports/airport/cycleDate > 3. CYCLE_DATE > 4. 59 > 5. ARPT.TXT > 6. APRT > 7. substring($ARPT_row/CYCLE__DATE, 3) /* bonus points if you can also remove the surrounding quote symbols) */ > Clearly this isn't the solution: > > tokenize(. ',') > > as it erroneously breaks apart the last field (string containing commas). Depending on the data sources producing the CSV you need to parse, there is likely to be an indefinitely long series of complications, so take what follows with a grain of salt. But for comma-separated values in which every value is either an unquoted string containing no commas and no newlines, or a double-quoted string possibly containing commas, newlines, and/or doubled double quotation marks), a grammar like this will produce the tokenization you describe: data: line ++ newline. line: field ++ -','. field: quoted; unquoted. -quoted: -#22, (~[#22]; (-#22, #22))*, -#22. -unquoted: null; ~[#22; ","; #A; #D], ~[","; #A; #D]*. -null: {} +"''". -newline: #D?, #A. Or, with comments and explanations: the incoming data stream to be parsed is a sequence of one or more lines separated by newlines (see below). data: line ++ newline. A line is one or more fields separated by commas. An empty line is thus understood as containing one empty field. line: field ++ -','. A field is either a quoted field or an unquoted field. field: quoted; unquoted. A quoted field begins and ends with a quotation mark (which, for legibility and to spare my Emacs mode the trauma of unmarked quotation marks, I write using its hex value: double quotation mark is U+0022, which we can write #22), and in the middle contains by zero or more occurrences of - any character which is not a double quotation mark - a pair of double quotation marks, only one of which is copied to the output The double quotation marks at the beginning and ending of the quoted field are suppressed (by writing a minus sign in front of the '#22'). -quoted: -#22, (~[#22]; (-#22, #22))*, -#22. An unquoted field is either a null field (see below) or a sequence of one or more characters which can be any character other than a comma, a carriage return, or a linefeed; the first character must not be a double quotation mark. -unquoted: null; ~[#22; ","; #A; #D], ~[","; #A; #D]*. The description of the desired output shows an empty field being tokenized as a pair of apostrophes; that seems unusual, but if that's what is required, we can produce it. We do so by distinguishing null fields as a special subcase of unquoted fields. In the input, a null field is the empty string. When we recognize one, we insert two apostrophes into the output. -null: {} +"''". A newline is either a linefeed or a carriage-return + linefeed pair. -newline: #D?, #A. It would be great, I think, if anyone could take the time to show how a Daffodil (DFDL) processor could be used for this tokenization task. -- C. M. Sperberg-McQueen Black Mesa Technologies LLC http://blackmesatech.com
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] proposed on-no-match="sha, Martin Honnen martin | Thread | Re: [xsl] How to tokenize a comma-s, C. M. Sperberg-McQue |
Re: [xsl] proposed on-no-match="sha, Martin Honnen martin | Date | Re: [xsl] How to tokenize a comma-s, C. M. Sperberg-McQue |
Month |