Re: [xsl] How to tokenize a comma-separated CSV record which has a field containing a string that has commas?

Subject: Re: [xsl] How to tokenize a comma-separated CSV record which has a field containing a string that has commas?
From: "Alan Painter alan.painter@xxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Fri, 5 Aug 2022 16:11:32 -0000
Hi Roger,

I've used the following XSLT2 templates and funtions for about 10 years and
they have worked for me in a lot of different situations with CSV input,
coming from Excel or other, including with TAB and semicolon-separated
cells.

There are functions that could be called statically rather than dynamically
if the listSeparator is known ahead of time, if performance is important.

The main simplification comes from adding the listSeparator (comma for a
CSV but sometimes semicolon on the continent and even can be TAB) at the
end of each line before parsing it.

If a CSV column (here called a *Cell*) has a double quote starting it, it
must also have a double quote at the end and any embedded quotes are
doubled/escaped.
The unescape function detects these and does the necessary substring and
replacement work.

In order to read this from Martin Honnen's excellent xslt fiddle, I've used
as input an XML document with a root element *csvText* containing the CSV
text from your example:


*<csvText>cycleDate,AeroPublication/airports/airport/cycleDate,CYCLE_DATE,59,ARPT.TXT,ARPT,"substring($ARPT_row/CYCLE__DATE,
3)"</csvText>*

The complete XSLT with the templates for parsing is below.  Please note
that the template matching "/" is only for retrieving the CSV text from the
test document above.

You'll notice that it's in XSLT2 and could be shorter in XSLT3.

You'll also notice that in the case that there is an error in the CSV,
notably that a Cell starts with a quote but does not end with a quote,
there is an attribute "nonMatching" added to the output Cell.  You can
check the output Cell elements for an attribute to determine if the CSV
input was poorly formatted.

Hoping that you find this useful.

Best regards

-alan

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet
  xmlns:xsl = "http://www.w3.org/1999/XSL/Transform";
  xmlns:f   = "local:functions"
  xmlns:xs  = "http://www.w3.org/2001/XMLSchema";
  exclude-result-prefixes="#all"
  version="2.0">

  <xsl:output method="xml" indent="yes" />

  <!-- The following template is just for retrieving the test CSV text from
the XML input document. -->
  <xsl:template match="/" >
    <xsl:call-template name="readCsvToLines">
      <xsl:with-param name="inputText"     select="/csvText" />
      <xsl:with-param name="listSeparator" select="','" />
    </xsl:call-template>
  </xsl:template>

  <xsl:template name="readCsvToLines" as="element(Line)*">
    <xsl:param name="inputText"     as="xs:string?" />
    <xsl:param name="listSeparator" as="xs:string"  />

    <xsl:analyze-string select="$inputText" regex="\r?\n" >
      <xsl:non-matching-substring>
        <Line>
          <xsl:call-template name="readCsvLineToCells">
            <xsl:with-param name="inputLine"     select="." />
            <xsl:with-param name="listSeparator" select="$listSeparator" />
          </xsl:call-template>
        </Line>
      </xsl:non-matching-substring>
    </xsl:analyze-string>
  </xsl:template>

  <xsl:variable name="QUOTE"      as="xs:string" select="'&quot;'"
      />
  <xsl:variable name="TWO_QUOTES" as="xs:string" select="concat($QUOTE,
$QUOTE)" />

  <xsl:template name="readCsvLineToCells" as="element(Cell)*" >
    <xsl:param name="inputLine"     as="xs:string?" />
    <xsl:param name="listSeparator" as="xs:string"  />

    <xsl:analyze-string select="concat($inputLine, $listSeparator)"
regex="{f:buildRegex($listSeparator)}" >
      <xsl:matching-substring>
        <Cell>
          <xsl:value-of select="f:unescapeQuotes(regex-group(1))" />
        </Cell>
      </xsl:matching-substring>

      <xsl:non-matching-substring>
        <Cell nonMatching="{concat('Error in CSV format: text did not
match: ', .)}" >
          <xsl:value-of select="concat('Error in CSV format: text did not
match: ', .)" />
        </Cell>
      </xsl:non-matching-substring>
    </xsl:analyze-string>
  </xsl:template>

  <xsl:function name="f:buildRegex" as="xs:string" >
    <xsl:param name="listSeparator" as="xs:string" />

    <xsl:variable name="charsNotSepQuote" as="xs:string"
select="concat('[^', $QUOTE, $listSeparator, ']*')" />
    <xsl:variable name="charsNotQuote"    as="xs:string"
select="concat('[^', $QUOTE,                 ']*')" />
    <xsl:variable name="charsInQuotes"    as="xs:string"
select="concat($QUOTE, '(', $charsNotQuote, '|', $TWO_QUOTES, ')*',
$QUOTE)" />

    <xsl:sequence select="concat('(', $charsNotSepQuote, '|',
$charsInQuotes, ')', $listSeparator)" />
  </xsl:function>

  <xsl:function name="f:unescapeQuotes" as="xs:string" >
    <xsl:param name="rawLine" as="xs:string" />

    <xsl:choose>

    <xsl:when test="starts-with($rawLine, $QUOTE)" >
      <xsl:variable name="quotesRemoved" as="xs:string"
                    select="substring($rawLine, 2,
string-length($rawLine)-2)" />
      <xsl:value-of select="replace($quotesRemoved, $TWO_QUOTES, $QUOTE)" />
    </xsl:when>

    <xsl:otherwise>
      <xsl:value-of select="$rawLine" />
    </xsl:otherwise>

    </xsl:choose>
  </xsl:function>

</xsl:stylesheet>



On Thu, Aug 4, 2022 at 8:53 PM Roger L Costello costello@xxxxxxxxx <
xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote:

> Hi Folks,
>
> I'm stuck.
>
> I want to tokenize this:
>
> airport,AeroPublication/airports/airport,ARPT_IDENT,12,ARPT.TXT,ARPT,
>
> into these 7 tokens:
>
> 1. airport
> 2. AeroPublication/airports/airport
> 3. ARPT_IDENT
> 4. 12
> 5. ARPT.TXT
> 6. ARPT
> 7. ''    /* empty string */
>
> And tokenize this:
>
> cycleDate,AeroPublication/airports/airport/cycleDate,CYCLE_DATE,59,ARPT.TXT,ARPT,"substring($ARPT_row/CYCLE__DATE,
> 3)"
>
> into these 7 tokens:
>
> 1. cycleDate
> 2. AeroPublication/airports/airport/cycleDate
> 3. CYCLE_DATE
> 4. 59
> 5. ARPT.TXT
> 6. APRT
> 7. substring($ARPT_row/CYCLE__DATE, 3)   /* bonus points if you can also
> remove the surrounding quote symbols) */
>
> Clearly this isn't the solution:
>
>       tokenize(. ',')
>
> as it erroneously breaks apart the last field (string containing commas).
>
> Suggestions?
>
> /Roger

Current Thread