[xsl] Using xsl:analyze-string and regex to parse long lines with white-space

Subject: [xsl] Using xsl:analyze-string and regex to parse long lines with white-space
From: Rob Newman <rlnewman@xxxxxxxx>
Date: Tue, 19 Jun 2007 11:38:54 -0700
Hi All,

I have an input file "input.xml":

input.xml
-------------
<pfarr>
<pfstring name="dlsite">
q330 0000 345 1169760599.99999 TA_D03A 921 47 -123 0.0325 regular internet hosted 1172293472.07035
q330 0123 234 9999999999.99900 TA_HAST 1005 36 -121 0.5558 regular internet hosted 1172293966.53652
q330 0234 123 1157317200.00000 TA_U04C 718 36 -120 0.7886 vsat spacenet 1172298386.07728
</pfstring>
</pfarr>


I am trying to parse the contents of <pfstring> to get the 5th column ("TA_D03A" in the example), the 10th ("regular internet") and the 11th ("hosted") for each line and push it to "output.xml" thus:

output.xml
---------------
<dlsites>
	<site name="TA_D03A">
		<comt>regular internet</comt>
		<comp>hosted</comp>
	</site>
	<site name="TA_HAST">
		<comt>regular internet</comt>
		<comp>hosted</comp>
	</site>
	<site name="TA_U04C">
		<comt>vsat</comt>
		<comp>spacenet</comp>
	</site>
</dlsites>

Each entry in input.xml/pfarr/pfstring is on a new line. I am trying to use the regex functions and have the following, but it does not seem to be working:

transform.xsl
-----------------
<?xml version="1.0" encoding="ISO-8859-1"?>

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/ Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes" />


<xsl:template match="/">
    <dlsites>
        <xsl:apply-templates select="/pfarr/pfstring" />
    </dlsites>
</xsl:template>

<xsl:template match="pfstring[@name = 'dlsite']">
    <xsl:variable name="elValue" select="." />

<xsl:analyze-string select="$elValue" regex="\s*(.*)\s+(.*)\s+ (.*)\s+(.*)\s+(.*)\s+(.*)\s+(.*)\s+(.*)\s+(.*)\s+(.*)\s+(.*)\s+\n">

        <xsl:matching-substring>
            <xsl:variable name="dlname" select="regex-group(5)" />
            <site name="{@dlname}">
                <comt><xsl:value-of select="regex-group(10)"/></comt>
                <comp><xsl:value-of select="regex-group(11)"/></comp>
            </site>
        </xsl:matching-substring>

        <xsl:non-matching-substring>
            <unknown>
                <xsl:value-of select="$elValue"/>
            </unknown>
        </xsl:non-matching-substring>

</xsl:analyze-string>

</xsl:template>

</xsl:stylesheet>

Is this the most efficient way of processing this type of file? It is highly likely that I have something wrong in the regex section - any pointers would be appreciated. The XSLT processor I am using is Saxon 8.9J.

Thanks in advance!
- Rob Newman

Current Thread