Re: Regular expression functions (Was: Re: [xsl] comments on

I wrote:
> I'll think some more...

And of course had an idea immediately I went to bed, and therefore
couldn't sleep...

In XSLT, you *select* a bunch of nodes to process, the processor goes
through them one by one, and you have templates that *match* those
nodes and provide whatever output you want for them. This has proved a
very flexible way of going about things, especially in cases where you
have deeply nested, unpredictable structures.

So to deal with strings that have deeply nested, unpredictable
structures (such as David's example), perhaps that same kind of
approach would work. You need a way of selecting a sequence of strings
and applying templates to them, where the templates have regular
expression patterns. Something along the lines of:

  <!-- Category: instruction -->
  <xsl:apply-regexp-templates
    select = string-sequence-expression
    mode   = qname>
    <!-- Content: (xsl:sort | xsl:with-param)* -->
  </xsl:apply-regexp-templates>

You also need something for declaring regular expression templates
that match those strings. Something along the lines of:

  <!-- Category: declaration -->
  <xsl:regexp-template
    match    = regular-expression
    priority = number
    mode     = qname>
    <!-- Content: (xsl:param*, content-constructor) -->
  </xsl:regexp-template>

When you use xsl:apply-regexp-templates, the processor goes through
the string sequence one string at a time in the sorted order (or
original order) and tries to find a template that matches the entire
string. It finds the highest-priority template that matches the entire
string (note that there are no implied priorities, so you have to use
the priority attribute if a string might match more than one
template), and uses that to create content. Modes and parameters work
in the usual way.

Within the xsl:regexp-template element, the context item is the string
that's matched by the template; the context position is its position
within the (sorted) string sequence to which regexp templates were
applied; the context size is the length of that sequence.

The evaluation context includes a current match, which is a sequence
of strings - the subexpressions from the match regular expression. You
can retrieve this sequence using the current-match() function.

[Or something along those lines - there are lots of possibilities
 for how you get hold of that information.]

Taking a simple example:

  <xsl:apply-regexp-templates select="'13/1/02'" mode="date" />

The processor applies templates to the date; there are multiple
templates in date mode (for different date formats), but the one that
matches with the highest priority is:

<xsl:regexp-template match="([0-9]{1,2})/([0-9]{1,2})/([0-9]{2})"
                     mode="date">
  <xsl:variable name="day"
                select="format-number(current-match()[1], '00')" />
  <xsl:variable name="month"
                select="format-number(current-match()[2], '00')" />
  <xsl:variable name="year"
                select="if (current-match()[3] > 30)
                        then (current-match()[3] + 1900)
                        else (current-match()[3] + 2000)" />
  <xsl:value-of select="($year, $month, $day)"
                separator="-" />
</xsl:regexp-template>


To supplement the template pattern, there should be an instruction
that merges the xsl:apply-regexp-templates and the
xsl:regexp-template:

  <!-- Category: instruction -->
  <xsl:match
    select = string-expression
    regexp = regular-expression>
    <!-- Content: (xsl:sort*, content-constructor) -->
  </xsl:match>

For simple cases like the above, this allows you to just do:

  <xsl:match select="'13/1/02'"
             regexp="([0-9]{1,2})/([0-9]{1,2})/([0-9]{2})">
    <xsl:variable name="day"
                  select="format-number(current-match()[1], '00')" />
    <xsl:variable name="month"
                  select="format-number(current-match()[2], '00')" />
    <xsl:variable name="year"
                  select="if (current-match()[3] > 30)
                          then (current-match()[3] + 1900)
                          else (current-match()[3] + 2000)" />
    <xsl:value-of select="($year, $month, $day)"
                  separator="-" />
  </xsl:match>


To make it easier to construct string sequences to which to apply
regular expression templates, I suggest a function (or two, perhaps,
given the general avoidance of function overloading) that basically
tokenises a string based on a regular expression. The signature of the
function would be:

  tokenize(string $string, string $regexp) => string*
  tokenize(string $string, string $start-regexp, string $end-regexp)
    => string*

The first form splits $string into a sequence of strings. Every even
string matches the $regexp. For example:

  tokenize(' foo  bar   baz', '\s+')
    => ('', ' ', 'foo', '  ', 'bar', '   ', 'baz')

The second form does a similar thing, except that the even-positioned
strings must begin with the $start-regexp and end with the
$end-regexp. What's more, each even string in the result must be
balanced - it must contain an equal number of substrings matching the
$start-regexp as match the $end-regexp (with no overlapping). For
example:

  tokenize('this is \bold{bold \italic{and italic}} text',
           '\\[a-z]+\{', '\}')
    => ('this is ', '\bold{bold \italic{and italic}}', ' text')

Note that any odd string in the result may contain a substring that
matches the $end-regexp; similarly, the last string in the result may
start with a match for the $start-regexp, if there's no matching
$end-regexp. Also, in some strings the substring matching the
$start-regexp may overlap with the substring matching the $end-regexp.
    
To make it easier to manage formats like messy HTML, where you need
the $end-regexp to contain something from the $start-regexp,
$end-regexp can contain back references to subexpressions within
$start-regexp, in the form \1...\N. For example (not escaping <s for
readability):

  tokenize('this <img src="glyph.gif"> is <b>bold</b> text',
           '<([a-z]+)>', '</\1>')
    => ('this ', '<img src="glyph.gif"> is <b>bold</b> text')


The fact that the tokenize() function takes regular expression strings
means that it's possible to construct regular expressions on the fly.
The fact that you *can't* construct regular expressions with the other
regular expression constructs (they don't have attribute value
templates), means that they can be parsed when the processor first
reads the stylesheet rather than at runtime, which is good for
efficiency, I think, especially considering how many regular
expression templates you might have.

I think that the regular expressions in tokenize() give you all you
actually need. For example, to go through a piece of text and add an
em element around every occurrence of $keyword (as a whole word) in
the text, you could use:

  <xsl:for-each
    select="tokenize($text, concat('\W+', $keyword, '\W+'))">
    <xsl:choose>
      <xsl:when test="position() mod 2 = 1">
        <xsl:value-of select="." />
      </xsl:when>
      <xsl:otherwise>
        <xsl:for-each select="tokenize(., '\W+')">
          <xsl:choose>
            <xsl:when test="position() mod 2 = 0">
              <xsl:value-of select="." />
            </xsl:when>
            <xsl:otherwise>
              <em>
                <xsl:value-of select="." />
              </em>
            </xsl:otherwise>
          </xsl:choose>
        </xsl:for-each>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:for-each>

But if you have a static regular expression (and you don't have to
worry about bracket balancing) it's simpler to use xsl:match or
xsl:apply-regexp-templates instead:

  <xsl:for-each
    select="tokenize($text, concat('\W+', $keyword, '\W+'))">
    <xsl:choose>
      <xsl:when test="position() mod 2 = 1">
        <xsl:value-of select="." />
      </xsl:when>
      <xsl:otherwise>
        <xsl:match select="." regexp="(\W+)(.*)(\W+)">
          <xsl:value-of select="current-match()[1]" />
          <em>
            <xsl:value-of select="current-match()[2]" />
          </em>
          <xsl:value-of select="current-match()[3]" />
        </xsl:match>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:for-each>

I think a lot of this could be refined, but that as a general approach
it might be feasible. Any thoughts?
  
Cheers,

Jeni

---
Jeni Tennison
http://www.jenitennison.com/


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
<- Previous	Index	Next ->
Re: Regular expression functions (W, Jeni Tennison	Thread	Re: Regular expression functions (W, David Carlisle
Re: [xsl] Re: Re: A question about , Jeni Tennison	Date	[xsl] Re: A question about the expr, Jeni Tennison
	Month
<-prev [Thread] next->	<-prev [Date] next->
Month Index \| List Home
Re: Regular expression functions (Was: Re: [xsl] comments on December F&O draft)