Subject: Re: Regular expression functions (Was: Re: [xsl] comments on December F&O draft) From: Jeni Tennison <jeni@xxxxxxxxxxxxxxxx> Date: Sat, 12 Jan 2002 18:36:08 +0000 |
Hi David, > One of my main problems is that currently I can't see a way to specify > things that would actually address the main use case I have for this. > (Which is a real use case for the day-job not something I just made > up:-) Cripes. Who knows where we'll end up if we start looking at real-life use cases rather than theoretical examples. ;) > Suppose you had a document which was marked up as XML but in which the > mathematics was marked up as > > <maths> > \frac{-b \pm \sqrt{b^2 -4ac}}{2a} > </maths> > > and you had 96000 of these math expressions in the document > collection (which can be either one document linked via external > entities or 1200 separate documents, according to taste). > > So you knock up some XSL that transforms the XML bits into XHTML, > but what do you do with the mathematics (which, as always, is the > most interesting part)? Seriously? You write an extension function to create the relevant XML. Or you do it in a pre-processing step. Honestly, I can't see much difference between having to handle this syntax and having to handle: <maths> <![CDATA[ <mfrac> <mrow> <mrow><mo>-</mo><mi>b</mi></mrow> <mo>±</mo> <msqrt> <msup><mi>b</mi><mn>2</mn></msup> <mo>-</mo> <mrow> <mn>4</mn> <mo>⁢</mo> <mi>a</mi> <mo>⁢</mo> <mi>c</mi> </mrow> </msqrt> </mrow> <mrow><mn>2</mn><mo>⁢</mo><mi>a</mi></mrow> </mfrac> ]]> </maths> (Or, of course, more usually, dealing with nested HTML syntax within CDATA structures.) But anyway, I like the LEX/YACC idea, and think it might be the way to go. You basically create a grammar for the structure, using regular expressions to represent the component parts. Something along the lines of: $row => ($frac)|($sqrt)|($expr) $frac => \\frac\{($row)\}\{($row)\} $sqrt => \\sqrt\{($row)\} $expr => ($times)|(($operand)?(\s*($operator)\s*($operand))+) $times => ($operand){2,} $operand => ($row)|($sup)|($number)|($ident) $sup => ($operand)\^($operand) $operator => \\pm|\- $number => -?[0-9]+(\.[0-9]+)? $ident => [a-z][a-z0-9]* These assignments could be done with xsl:regexp elements, which would work like variable-binding elements, but would be used purely for identifying regular expressions. For example: <xsl:regexp name="row" select="'($frac)|($sqrt)|($expr)'" /> <xsl:regexp name="frac" select="'\\frac\{($row)\}\{($row)\}'" /> ... Then you'd have a xsl:match instruction which would select an expression that it turned into a string, and have a regexp attribute which held a regular expression. For example: <xsl:match select="'\frac{-b \pm \sqrt{b^2 -4ac}}{2a}'" regexp="($row)"> ... </xsl:match> Within the xsl:match, a current-match() function would give you access to a tree that represents the matched portions of the string. In this case it would look something like the following (but with less whitespace, obviously): <row start="1" end="34"> <frac start="1" end="34"> \frac{ <row start="7" end="29"> <expr start="7" end="29"> <operator start="7" end="8">-</operator> <operand start="8" end="9"> <ident start="8" end="9">b</ident> </operand> <operator start="10" end="13">\pm</operator> <operand start="14" end="29"> <row start="14" end="29"> <sqrt start="14" end="29"> <row start="20" end="28"> <expr start="20" end="28"> <operand start="20" end="23"> <sup start="20" end="23"> <operand start="20" end="21"> <ident start="20" end="21">b</ident> ^ <number start="22" end="23">2</number> </operand> <operator start="24" end="25">-</operator> <operand start="25" end="28"> <row start="25" end="28"> <expr start="25" end="28"> <times start="25" end="28"> <number start="25" end="26">2</number> <ident start="26" end="27">a</ident> <ident start="27" end="28">c</ident> </times> </expr> </row> </operand> </sup> </operand> </expr> </row> </sqrt> </row> </operand> </expr> </row> }{ <row start="31" end="33"> <expr start="31" end="33"> <times start="31" end="33"> <number start="31" end="32">2</number> <ident start="32" end="33">a</number> </times> </expr> </row> } </frac> </row> (Of course it would look different with a different grammar.) I am fairly convinced that it's possible to create an implementation to do this, based on the fact that there are plenty of lexers out there that do roughly the same thing. I am less convinced that it would be particularly efficient, in particular if the regular expressions could be constructed on the fly (which as I've stated, I think it should be). On the other hand, I doubt that the full functionality would be required very frequently (except by strange people like you), and I think that the more common uses (where the named subexpressions are static) could be optimised without too many problems (the grammar compiled into code during the creation of the stylesheet). Perhaps it could be restricted by making the subexpressions static by definition (xsl:regexp would only use its content to define the subexpression), with only the regexp attribute in xsl:match being an attribute value template. Cheers, Jeni --- Jeni Tennison http://www.jenitennison.com/ XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: Regular expression functions (W, David Carlisle | Thread | Re: Regular expression functions (W, David Carlisle |
[xsl] A question about the expressi, Dimitre Novatchev | Date | [xsl] Re: Regular expression functi, Dimitre Novatchev |
Month |