Re: Regular expression functions (Was: Re: [xsl] comments on December F&O draft)

Subject: Re: Regular expression functions (Was: Re: [xsl] comments on December F&O draft)
From: David Carlisle <davidc@xxxxxxxxx>
Date: Sat, 12 Jan 2002 17:04:21 GMT

Jeni,

>
>   \para{\italic{this} is \bold{bold \italic{and italic}} text.}

Ohh looks just like TeX, we'll get you using that yet...

I can think of two ways of attacking the above with regexp.

* Plan A (which is the way I'd do it in emacs) is to 

have a regexp replace

\(\\[a-z]*\){\([^{}*]\)}  to <\1>\2</\1>

This matches innermost groups first, they don't have any nested {} so
you can easily find the matching }.
As the replace also removes the {} you just need a loop which terminates
once the regexp no longer matches, so the replacements go

 
\para{\italic{this} is \bold{bold \italic{and italic}} text.}

\para{<italic>this</italic> is \bold{bold <italic>and italic</italic>} text.}

\para{<italic>this</italic> is <bold>bold <italic>and italic</italic></bold> text.}

<para><italic>this</italic> is <bold>bold <italic>and italic</italic></bold> text.</para>

(generated the above using emacs:-)

That's fine but requires that either you consider the XML markup just to
be part of the string (which is what I did here but what we want to
avoid in XSLT) or that your regexps can match across mixed content
models ie instead of [^{}]*  meaning any character other than a brace
you'd need something that says any character-or-node other than a brace.


The alternative to Plan A is of course:

Plan 2:
work from the outside in: (This is the way I'd do it in omnimark)
Basically the plan here is not to try to match a whole matching brace
clause but just to match each start and end in turn, maintaining a
counter that increments on { and decrements on } so you know what
matches with what.

It's a bit hard to fit that counter model into the XSLT world view but
there is a variant, 

plan 2':
I suspect that one way to attack this in xslt2 is just to have two
simple regexp replaces

\\\([a-z]*\){  -> <start name="\1"/>

}              -> <end/>


so after doing the regexp matching I'd have:

<start name="para"/><start name="italic"/>this<end/> is <start name="bold"/>bold <start name="italic"/>and italic<end/><end/> text.<end/>

so now we've got rid of that flat string and replaced it by something
that's still flat but is mixed content with  empty element nodes and
text.

Getting from that flat mixed content to a hierarchical element tree is
just the famous xslt grouping problem which a typical Gumbie Cat ought
to be able to do in her sleep, especially if given the xslt2 grouping
constructs.




So while I'm tempted to see if plan A can be made to work as  the the
two stage plan 2' doesn't seem so clean in some ways. I suspect that
integrating plan 2' would be much simpler, as you wouldn't have to extend
regexp searching to search mixed content, just extend regexp replace so
it can generate mixed content.

David

_____________________________________________________________________
This message has been checked for all known viruses by Star Internet
delivered through the MessageLabs Virus Scanning Service. For further
information visit http://www.star.net.uk/stats.asp or alternatively call
Star Internet for details on the Virus Scanning Service.

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread