Subject: RE: Regular expression functions (Was: Re: [xsl] comments on December F&O draft) From: "Marc Portier" <mpo@xxxxxxxxxxxxxxxx> Date: Thu, 10 Jan 2002 01:52:33 +0100 |
David, > Interesting! Thanx > > If I understand your syntax it wasn't so far from the syntax and > functionality of the sketch I made earlier. Nice to know I'm not > completely mad (or at least not alone:-) togetherness sure is the joy of both the sane and the insane :-) the syntax match is what kinda pushed us to drop it into the discussion there were loaths of reasons not to... 1. it's far from being finished, but I guess input from various places early is great (just want to be honest about it being a running thought, with some growing understanding around it.) by the way the <regex><![CDATA[ subelement is in the implementation now, makes it a bit more useable for the processing of HTML input 2. kinda felt it not really matching the title of the thread (or even the purpose of the list)... apart from taking up some flavor of xslt syntax (to make people comfortable), it has very little to do with XSL (actually, it's implemented as complying to the jaxp saxparser interface that needs the so called regexslt-sheet to be set as a property) the regxslt goal is 'uptranslating' as Steven pointed out, so we start off from having an input that is never considered as being XML-structured the xslt regex functions as I understand it kinda need to work on XML input files that are 'not structured enough???' meaning you're searching for pattterns on text only inside the boundaries of parent nodes, right? the regexslt approach now kinda allows us also to match accross the node boundaries which makes it more suiteable for input that is what I would call 'badly structured' (HTML, as you might guess, often applies) (euh while typing...: it would be a fearly easy job to make the current implementation behave as a jaxp transformer as well, the most naieve implementation would be to start of with serializing the inputsource to a string and start regexslting... in this appearance it would be fairly easy to let it operate on subnodes passed to it by some xslt parser... after all having something that works (and can be modified) helps out in thinking and talking about it by trying and incrementally changing, or eventually throwing away cause you've finally proven you don't need it anyway :-)) in any case I hope some of our work and findings could help your discussion as well. > > I agree if your regexps are matching html with lots of < in them > the quoting in attributes can be a pain (see parallel thread on <<) > but in the xslt context at least, probably it should be in an attribute > anyway. As Xpath gets more power, and so xpaths get longer there may at > some point be a general requirement to offer an element content > alternative to xpath match and select attributes so that " and ' don't > need to be quoted at all and cdata can be used in place of <s if > desired, but that seems to be a general xpath issue (in an xslt context, > if not for your tool) rather than something specific to regexps. > > having implemented the beast, do you have insights into the trickier > questions that I managed to duck so far by denying all knowledge of my > own suggestion? > > In particular how do multiple matches work? most of your question have one simple answer: in the most naieve way :-) some people (that earn more money) would call it the most intuïtive way :-) multiple matches on the same input are written after each other, and executed after each other working on that input... they both get their view on it, and get to match it <matcher regex="rg1" > ...output format that gets repeated after each other for every match event on the input </matcher> <matcher regex="rg2" > ...other output format that gets repeated after each other for every match event on the input </matcher> both live inside an element that has selected a certain matched group from the parent they live in if that parent is a matcher itself, they themselves can select one of it's mapped groups if they don't they work on whatever he got as input (unmodified that is) > is one match/replace happening at a time in the order of your > call-matcher elements or are they being searched concurrently with some > priority rules saying which happens in the case of overlaps? haven't followed the discussion up to here (Steven's posting made me a greenhorn member of this list :-)) answer: the first: order of call-matcher (or matcher, we allow inline definition and reuse) so eh, we kinda decided we can't have overlaps :-), we take a very straightforward approach in never looking to our own output... we had discussions on this more xslt-template like behavior, but pretty soon came down to 'we're not going to rewrite xsl.' the single goal of live for regexslt is ignoring bad placed and inserting missing tags in a text... the fancy juggling around should be done afterwards with (one or more) xslt that work on the output of regexslt e.g. the current cli implementation takes a config parameter that applies a default xsl on the output before serializing to file this is also why we don't allow for matched text to be promoted to element or attribute names... we merely enrich with up front known markup, three reorganization should happen with xslt afterwards.. that said however, we'ld like to invite you to find reasons why we should leave this simplicity in design on the other hand, xslt could bounce off some partial node tree off to regexslt and get some node tree back as explained earlier... > > If your first call-matcher replaces some bit of the input string by > some tree fragment of interspersed text and elements (and > text contained within elements) then does the second call-matcher just > match on the remaining parts of the original string or does it also get > to work on the text in the tree structre that's being built by the > earlier matches? as earlier: we never work on our own output (xslt should) input is chuncked or passed (unchanged) down to nested matchers... nothing more (yet) some other thoughts that crossed our minds - conditional matching the only conditional thing we see at this stage as needed is having a fallback matcher that should only kick in when the first returns no matches. all other conditional stuff is in fact in the magic of the regexes itself (as far as we see now) if it matches, it will loop through all the matches, and in every match it wil hierarchically pass down matched groups to the nested output formatters or nested matchers to select one from as their input. - filling named arrays with matches to be used later all of which should still be simple as long as we keep up the assumption of the pure sequential nature of our prossessing. -marc= > > David > > > _____________________________________________________________________ > This message has been checked for all known viruses by Star Internet > delivered through the MessageLabs Virus Scanning Service. For further > information visit http://www.star.net.uk/stats.asp or alternatively call > Star Internet for details on the Virus Scanning Service. > XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: Regular expression functions (W, David Carlisle | Thread | Re: Regular expression functions (W, David Carlisle |
RE: [xsl] Creating External Generic, Joshua . Kuswadi | Date | RE: [xsl] Content constructors and , Kevin Jones |
Month |