[xsl] Re: Regular expression functions (Was: Re: comments on December F&O draft)

Subject: [xsl] Re: Regular expression functions (Was: Re: comments on December F&O draft)
From: Dimitre Novatchev <dnovatchev@xxxxxxxxx>
Date: Sat, 12 Jan 2002 11:18:01 -0800 (PST)
> But anyway, I like the LEX/YACC idea, and think it might be the way to
> go. You basically create a grammar for the structure, using regular
> expressions to represent the component parts.

Hi Jeni,

YACC is a parser generator for LALR(1) grammars (a subset of context-free grammars),
which describe languages much more powerful than what regular expressions can
describe.

> 
> Something along the lines of:
> 
>   $row      => ($frac)|($sqrt)|($expr)
>   $frac     =>  \\frac\{($row)\}\{($row)\}
>   $sqrt     =>  \\sqrt\{($row)\}
>   $expr     => ($times)|(($operand)?(\s*($operator)\s*($operand))+)
>   $times    => ($operand){2,}
>   $operand  => ($row)|($sup)|($number)|($ident)
>   $sup      => ($operand)\^($operand)
>   $operator => \\pm|\-
>   $number   => -?[0-9]+(\.[0-9]+)?
>   $ident    => [a-z][a-z0-9]*

This is not a regular grammar but a context-free grammar (although it is not
formally defined and I have to guess which exactly is the set of terminal and
non-terminal symbols, as well as which is the starting symbol).

It seems that $row, $frac, $sqrt, $expr and $operand are the non-terminal symbols.
In this case this is a context-free grammar that does not describe a regular
language (a language that can be described/generated by a set of regular
expressions). This is so, because in a set of rules describing a regular language
the rules can only have terminal symbols and then one variable. The rule for $expr
is in clear violation of this principle.

Therefore, the language described by the above grammar cannot be matched by a set of
regular expressions.

> 
> These assignments could be done with xsl:regexp elements, which would
> work like variable-binding elements, but would be used purely for
> identifying regular expressions. For example:

This cannot be done with regular expressions -- see above.

> 
> <xsl:regexp name="row" select="'($frac)|($sqrt)|($expr)'" />
> <xsl:regexp name="frac" select="'\\frac\{($row)\}\{($row)\}'" />
> ...
> 
> Then you'd have a xsl:match instruction which would select an
> expression that it turned into a string, and have a regexp attribute
> which held a regular expression. For example:
> 
>   <xsl:match select="'\frac{-b \pm \sqrt{b^2 -4ac}}{2a}'"
>              regexp="($row)">
>     ...
>   </xsl:match>
> 
> Within the xsl:match, a current-match() function would give you access
> to a tree that represents the matched portions of the string. In this
> case it would look something like the following (but with less
> whitespace, obviously):

[long tree snipped]

> (Of course it would look different with a different grammar.)
>   
> I am fairly convinced that it's possible to create an implementation
> to do this, based on the fact that there are plenty of lexers out
> there that do roughly the same thing.

As shown above, what would be needed is not a "lexer" but a parser for CF grammars.

In case a parser generator is used, it has to be fed with the rules of the grammar.
It then generates a table that guides the parsing process. This table typically
shows the action to be taken and the next state of the parser, if it is in a given
state, the top of the stack contains a given symbol and the input contains a given
terminal symbol. This table-driven parser will need a lexical analizer to call when
each next non-terminal from the input is needed.

You're suggesting this to be done dynamically at transformation time.

Not that this is impossible, but this is something much more complex and less
efficient than matching regular expressions.


Cheers,
Dimitre Novatchev.


__________________________________________________
Do You Yahoo!?
Send FREE video emails in Yahoo! Mail!
http://promo.yahoo.com/videomail/

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread