Subject: Re: [xsl] How to tokenize a comma-separated CSV record which has a field containing a string that has commas? From: "Alan Painter alan.painter@xxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> Date: Fri, 5 Aug 2022 16:11:32 -0000 |
Hi Roger, I've used the following XSLT2 templates and funtions for about 10 years and they have worked for me in a lot of different situations with CSV input, coming from Excel or other, including with TAB and semicolon-separated cells. There are functions that could be called statically rather than dynamically if the listSeparator is known ahead of time, if performance is important. The main simplification comes from adding the listSeparator (comma for a CSV but sometimes semicolon on the continent and even can be TAB) at the end of each line before parsing it. If a CSV column (here called a *Cell*) has a double quote starting it, it must also have a double quote at the end and any embedded quotes are doubled/escaped. The unescape function detects these and does the necessary substring and replacement work. In order to read this from Martin Honnen's excellent xslt fiddle, I've used as input an XML document with a root element *csvText* containing the CSV text from your example: *<csvText>cycleDate,AeroPublication/airports/airport/cycleDate,CYCLE_DATE,59,ARPT.TXT,ARPT,"substring($ARPT_row/CYCLE__DATE, 3)"</csvText>* The complete XSLT with the templates for parsing is below. Please note that the template matching "/" is only for retrieving the CSV text from the test document above. You'll notice that it's in XSLT2 and could be shorter in XSLT3. You'll also notice that in the case that there is an error in the CSV, notably that a Cell starts with a quote but does not end with a quote, there is an attribute "nonMatching" added to the output Cell. You can check the output Cell elements for an attribute to determine if the CSV input was poorly formatted. Hoping that you find this useful. Best regards -alan <?xml version="1.0" encoding="utf-8"?> <xsl:stylesheet xmlns:xsl = "http://www.w3.org/1999/XSL/Transform" xmlns:f = "local:functions" xmlns:xs = "http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="#all" version="2.0"> <xsl:output method="xml" indent="yes" /> <!-- The following template is just for retrieving the test CSV text from the XML input document. --> <xsl:template match="/" > <xsl:call-template name="readCsvToLines"> <xsl:with-param name="inputText" select="/csvText" /> <xsl:with-param name="listSeparator" select="','" /> </xsl:call-template> </xsl:template> <xsl:template name="readCsvToLines" as="element(Line)*"> <xsl:param name="inputText" as="xs:string?" /> <xsl:param name="listSeparator" as="xs:string" /> <xsl:analyze-string select="$inputText" regex="\r?\n" > <xsl:non-matching-substring> <Line> <xsl:call-template name="readCsvLineToCells"> <xsl:with-param name="inputLine" select="." /> <xsl:with-param name="listSeparator" select="$listSeparator" /> </xsl:call-template> </Line> </xsl:non-matching-substring> </xsl:analyze-string> </xsl:template> <xsl:variable name="QUOTE" as="xs:string" select="'"'" /> <xsl:variable name="TWO_QUOTES" as="xs:string" select="concat($QUOTE, $QUOTE)" /> <xsl:template name="readCsvLineToCells" as="element(Cell)*" > <xsl:param name="inputLine" as="xs:string?" /> <xsl:param name="listSeparator" as="xs:string" /> <xsl:analyze-string select="concat($inputLine, $listSeparator)" regex="{f:buildRegex($listSeparator)}" > <xsl:matching-substring> <Cell> <xsl:value-of select="f:unescapeQuotes(regex-group(1))" /> </Cell> </xsl:matching-substring> <xsl:non-matching-substring> <Cell nonMatching="{concat('Error in CSV format: text did not match: ', .)}" > <xsl:value-of select="concat('Error in CSV format: text did not match: ', .)" /> </Cell> </xsl:non-matching-substring> </xsl:analyze-string> </xsl:template> <xsl:function name="f:buildRegex" as="xs:string" > <xsl:param name="listSeparator" as="xs:string" /> <xsl:variable name="charsNotSepQuote" as="xs:string" select="concat('[^', $QUOTE, $listSeparator, ']*')" /> <xsl:variable name="charsNotQuote" as="xs:string" select="concat('[^', $QUOTE, ']*')" /> <xsl:variable name="charsInQuotes" as="xs:string" select="concat($QUOTE, '(', $charsNotQuote, '|', $TWO_QUOTES, ')*', $QUOTE)" /> <xsl:sequence select="concat('(', $charsNotSepQuote, '|', $charsInQuotes, ')', $listSeparator)" /> </xsl:function> <xsl:function name="f:unescapeQuotes" as="xs:string" > <xsl:param name="rawLine" as="xs:string" /> <xsl:choose> <xsl:when test="starts-with($rawLine, $QUOTE)" > <xsl:variable name="quotesRemoved" as="xs:string" select="substring($rawLine, 2, string-length($rawLine)-2)" /> <xsl:value-of select="replace($quotesRemoved, $TWO_QUOTES, $QUOTE)" /> </xsl:when> <xsl:otherwise> <xsl:value-of select="$rawLine" /> </xsl:otherwise> </xsl:choose> </xsl:function> </xsl:stylesheet> On Thu, Aug 4, 2022 at 8:53 PM Roger L Costello costello@xxxxxxxxx < xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote: > Hi Folks, > > I'm stuck. > > I want to tokenize this: > > airport,AeroPublication/airports/airport,ARPT_IDENT,12,ARPT.TXT,ARPT, > > into these 7 tokens: > > 1. airport > 2. AeroPublication/airports/airport > 3. ARPT_IDENT > 4. 12 > 5. ARPT.TXT > 6. ARPT > 7. '' /* empty string */ > > And tokenize this: > > cycleDate,AeroPublication/airports/airport/cycleDate,CYCLE_DATE,59,ARPT.TXT,ARPT,"substring($ARPT_row/CYCLE__DATE, > 3)" > > into these 7 tokens: > > 1. cycleDate > 2. AeroPublication/airports/airport/cycleDate > 3. CYCLE_DATE > 4. 59 > 5. ARPT.TXT > 6. APRT > 7. substring($ARPT_row/CYCLE__DATE, 3) /* bonus points if you can also > remove the surrounding quote symbols) */ > > Clearly this isn't the solution: > > tokenize(. ',') > > as it erroneously breaks apart the last field (string containing commas). > > Suggestions? > > /Roger
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] How to tokenize a comma-s, C. M. Sperberg-McQue | Thread | [xsl] [XSLT 2 or 3 - Diacritics] Re, Christophe Marchand |
Re: [xsl] How to tokenize a comma-s, C. M. Sperberg-McQue | Date | [xsl] [XSLT 2 or 3 - Diacritics] Re, Christophe Marchand |
Month |