Re: Regular expression functions (Was: Re: [xsl] comments on December F&O draft)

Subject: Re: Regular expression functions (Was: Re: [xsl] comments on December F&O draft)
From: Jeni Tennison <jeni@xxxxxxxxxxxxxxxx>
Date: Sat, 12 Jan 2002 18:36:08 +0000
Hi David,

> One of my main problems is that currently I can't see a way to specify
> things that would actually address the main use case I have for this.
> (Which is a real use case for the day-job not something I just made
> up:-)

Cripes. Who knows where we'll end up if we start looking at real-life
use cases rather than theoretical examples. ;)

> Suppose you had a document which was marked up as XML but in which the
> mathematics was marked up as 
>
> <maths>
> \frac{-b \pm \sqrt{b^2 -4ac}}{2a}
> </maths>
>
> and you had 96000 of these math expressions in the document
> collection (which can be either one document linked via external
> entities or 1200 separate documents, according to taste).
>
> So you knock up some XSL that transforms the XML bits into XHTML,
> but what do you do with the mathematics (which, as always, is the
> most interesting part)?

Seriously? You write an extension function to create the relevant XML.
Or you do it in a pre-processing step.

Honestly, I can't see much difference between having to handle this
syntax and having to handle:

<maths>
  <![CDATA[
  <mfrac>
    <mrow>
      <mrow><mo>-</mo><mi>b</mi></mrow>
      <mo>&PlusMinus;</mo>
      <msqrt>
        <msup><mi>b</mi><mn>2</mn></msup>
        <mo>-</mo>
        <mrow>
          <mn>4</mn>
          <mo>&InvisibleTimes;</mo>
          <mi>a</mi>
          <mo>&InvisibleTimes;</mo>
          <mi>c</mi>
        </mrow>
      </msqrt>
    </mrow>
    <mrow><mn>2</mn><mo>&InvisibleTimes;</mo><mi>a</mi></mrow>
  </mfrac>
  ]]>
</maths>

(Or, of course, more usually, dealing with nested HTML syntax within
CDATA structures.)

But anyway, I like the LEX/YACC idea, and think it might be the way to
go. You basically create a grammar for the structure, using regular
expressions to represent the component parts.

Something along the lines of:

  $row      => ($frac)|($sqrt)|($expr)
  $frac     =>  \\frac\{($row)\}\{($row)\}
  $sqrt     =>  \\sqrt\{($row)\}
  $expr     => ($times)|(($operand)?(\s*($operator)\s*($operand))+)
  $times    => ($operand){2,}
  $operand  => ($row)|($sup)|($number)|($ident)
  $sup      => ($operand)\^($operand)
  $operator => \\pm|\-
  $number   => -?[0-9]+(\.[0-9]+)?
  $ident    => [a-z][a-z0-9]*

These assignments could be done with xsl:regexp elements, which would
work like variable-binding elements, but would be used purely for
identifying regular expressions. For example:

<xsl:regexp name="row" select="'($frac)|($sqrt)|($expr)'" />
<xsl:regexp name="frac" select="'\\frac\{($row)\}\{($row)\}'" />
...

Then you'd have a xsl:match instruction which would select an
expression that it turned into a string, and have a regexp attribute
which held a regular expression. For example:

  <xsl:match select="'\frac{-b \pm \sqrt{b^2 -4ac}}{2a}'"
             regexp="($row)">
    ...
  </xsl:match>

Within the xsl:match, a current-match() function would give you access
to a tree that represents the matched portions of the string. In this
case it would look something like the following (but with less
whitespace, obviously):

  <row start="1" end="34">
    <frac start="1" end="34">
      \frac{
      <row start="7" end="29">
        <expr start="7" end="29">
          <operator start="7" end="8">-</operator>
          <operand start="8" end="9">
            <ident start="8" end="9">b</ident>
          </operand>
          <operator start="10" end="13">\pm</operator>
          <operand start="14" end="29">
            <row start="14" end="29">
              <sqrt start="14" end="29">
                <row start="20" end="28">
                  <expr start="20" end="28">
                    <operand start="20" end="23">
                      <sup start="20" end="23">
                        <operand start="20" end="21">
                          <ident start="20" end="21">b</ident>
                          ^
                          <number start="22" end="23">2</number>
                        </operand>
                        <operator start="24" end="25">-</operator>
                        <operand start="25" end="28">
                          <row start="25" end="28">
                            <expr start="25" end="28">
                              <times start="25" end="28">
                                <number start="25" end="26">2</number>
                                <ident start="26" end="27">a</ident>
                                <ident start="27" end="28">c</ident>
                              </times>
                            </expr>
                          </row>
                        </operand>
                      </sup>
                    </operand>
                  </expr>
                </row>
              </sqrt>
            </row>
          </operand>
        </expr>
      </row>
      }{
      <row start="31" end="33">
        <expr start="31" end="33">
          <times start="31" end="33">
            <number start="31" end="32">2</number>
            <ident start="32" end="33">a</number>
          </times>
        </expr>
      </row>
      }
    </frac>
  </row>

(Of course it would look different with a different grammar.)
  
I am fairly convinced that it's possible to create an implementation
to do this, based on the fact that there are plenty of lexers out
there that do roughly the same thing.

I am less convinced that it would be particularly efficient, in
particular if the regular expressions could be constructed on the fly
(which as I've stated, I think it should be).

On the other hand, I doubt that the full functionality would be
required very frequently (except by strange people like you), and I
think that the more common uses (where the named subexpressions are
static) could be optimised without too many problems (the grammar
compiled into code during the creation of the stylesheet).

Perhaps it could be restricted by making the subexpressions static by
definition (xsl:regexp would only use its content to define the
subexpression), with only the regexp attribute in xsl:match being an
attribute value template.

Cheers,

Jeni

---
Jeni Tennison
http://www.jenitennison.com/


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread