RE: Regular expression functions (Was: Re: [xsl] comments on December F&O draft)

Subject: RE: Regular expression functions (Was: Re: [xsl] comments on December F&O draft)
From: "Marc Portier" <mpo@xxxxxxxxxxxxxxxx>
Date: Thu, 10 Jan 2002 01:52:33 +0100
David,

> Interesting!
Thanx

>
> If I understand your syntax it wasn't so far from the syntax and
> functionality of the sketch I made earlier. Nice to know I'm not
> completely mad (or at least not alone:-)

togetherness sure is the joy of both the sane and the insane :-)

the syntax match is what kinda pushed us to drop it into the discussion
there were loaths of reasons not to...
1. it's far from being finished, but I guess input from various places early
is great (just want to be honest about it being a running thought, with some
growing understanding around it.)

by the way the <regex><![CDATA[ subelement is in the implementation now,
makes it a bit more useable for the processing of HTML input

2. kinda felt it not really matching the title of the thread (or even the
purpose of the list)... apart from taking up some flavor of xslt syntax (to
make people comfortable), it has very little to do with XSL (actually, it's
implemented as complying to the jaxp saxparser interface that needs the so
called regexslt-sheet to be set as a property)

the regxslt goal is 'uptranslating' as Steven pointed out, so we start off
from having an input that is never considered as being XML-structured

the xslt regex functions as I understand it kinda need to work on XML input
files that are 'not structured enough???' meaning you're searching for
pattterns on text only inside the boundaries of parent nodes, right?

the regexslt approach now kinda allows us also to match accross the node
boundaries which makes it more suiteable for input that is what I would call
'badly structured' (HTML, as you might guess, often applies)


(euh while typing...: it would be a fearly easy job to make the current
implementation behave as a jaxp transformer as well, the most naieve
implementation would be to start of with serializing the inputsource to a
string and start regexslting... in this appearance it would be fairly easy
to let it operate on subnodes passed to it by some xslt parser... after all
having something that works (and can be modified) helps out in thinking and
talking about it by trying and incrementally changing, or eventually
throwing away cause you've finally proven you don't need it anyway :-))

in any case I hope some of our work and findings could help your discussion
as well.

>
> I agree if your regexps are matching html with lots of < in them
> the quoting in attributes can be a pain (see parallel thread on <<)
> but in the xslt context at least, probably it should be in an attribute
> anyway. As Xpath gets more power, and so xpaths get longer there may at
> some point be a general requirement to offer an element content
> alternative to xpath match and select attributes so that " and ' don't
> need to be quoted at all and cdata can be used in place of &lt;s if
> desired, but that seems to be a general xpath issue (in an xslt context,
> if not for your tool) rather than something specific to regexps.
>
> having implemented the beast, do you have insights into the trickier
> questions that I managed to duck so far by denying all knowledge of my
> own suggestion?
>
> In particular how do multiple matches work?
most of your question have one simple answer: in the most naieve way :-)
some people (that earn more money) would call it the most intuïtive way :-)

multiple matches on the same input are written after each other, and
executed after each other working on that input... they both get their view
on it, and get to match it

<matcher regex="rg1" >
	...output format that gets repeated after each other for every match event
on the input
</matcher>
<matcher regex="rg2" >
	...other output format that gets repeated after each other for every match
event on the input
</matcher>

both live inside an element that has selected a certain matched group from
the parent they live in
if that parent is a matcher itself, they themselves can select one of it's
mapped groups
if they don't they work on whatever he got as input (unmodified that is)

> is one match/replace happening at a time in the order of your
> call-matcher elements or are they being searched concurrently with some
> priority rules saying which happens in the case of overlaps?
haven't followed the discussion up to here (Steven's posting made me a
greenhorn member of this list :-))

answer: the first: order of call-matcher (or matcher, we allow inline
definition and reuse)

so eh, we kinda decided we can't have overlaps :-), we take a very
straightforward approach in never looking to our own output... we had
discussions on this more xslt-template like behavior, but pretty soon came
down to 'we're not going to rewrite xsl.'  the single goal of live for
regexslt is ignoring bad placed and inserting missing tags in a text... the
fancy juggling around should be done afterwards with (one or more) xslt that
work on the output of regexslt

e.g. the current cli implementation takes a config parameter that applies a
default xsl on the output before serializing to file

this is also why we don't allow for matched text to be promoted to element
or attribute names... we merely enrich with up front known markup, three
reorganization should happen with xslt afterwards..

that said however, we'ld like to invite you to find reasons why we should
leave this simplicity in design

on the other hand, xslt could bounce off some partial node tree off to
regexslt and get some node tree back as explained earlier...

>
> If your first call-matcher replaces some bit of the input string by
> some tree fragment of  interspersed text and elements  (and
> text contained within elements) then does the second call-matcher just
> match on the remaining parts of the original string or does it also get
> to work on the text in the tree structre that's being built by the
> earlier matches?

as earlier: we never work on our own output (xslt should)
input is chuncked or passed (unchanged) down to nested matchers... nothing
more (yet)

some other thoughts that crossed our minds
- conditional matching
the only conditional thing we see at this stage as needed is having a
fallback matcher that should only kick in when the first returns no matches.
all other conditional stuff is in fact in the magic of the regexes itself
(as far as we see now)
if it matches, it will loop through all the matches, and in every match it
wil hierarchically pass down matched groups to the nested output formatters
or nested matchers to select one from as their input.

- filling named arrays with matches to be used later
all of which should still be simple as long as we keep up the assumption of
the pure sequential nature of our prossessing.

-marc=

>
> David
>
>
> _____________________________________________________________________
> This message has been checked for all known viruses by Star Internet
> delivered through the MessageLabs Virus Scanning Service. For further
> information visit http://www.star.net.uk/stats.asp or alternatively call
> Star Internet for details on the Virus Scanning Service.
>


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread