Subject: RE: Regular expression functions (Was: Re: [xsl] comments on December F&O draft) From: "Steven Noels" <stevenn@xxxxxxxxxxxxxxxx> Date: Wed, 9 Jan 2002 22:04:15 +0100 |
> -----Original Message----- > From: owner-xsl-list@xxxxxxxxxxxxxxxxxxxxxx > [mailto:owner-xsl-list@xxxxxxxxxxxxxxxxxxxxxx]On Behalf Of Michael Kay > Sent: woensdag 9 januari 2002 12:40 > To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx > Subject: RE: Regular expression functions (Was: Re: [xsl] comments on > December F&O draft) > I'm interested in your exploration of the use-cases for > regexp matching and > possible XSLT constructs to support those use cases, though > so far I've had > difficulty following the "make-it-up-as-you-go-along" style of > specification! > > Mike Kay We are currently working on a little tool (packaged as a Cocoon generator, an Ant task and a CLI app) that is more or less Omnimark-like, i.e. it enables you to 'uptranslate' a non-XML document (HTML, delimited ASCII, ...) to an XML document. We baptised it Regexslt since it borrows (a little bit) from the XSLT language design. It is based on the Jakarta ORO regex library. Using the input document (can be a URL) http://www.bloomberg.com/bbn/technology.html and this regexslt specification: <?xml version="1.0" encoding="UTF-8"?> <regexslt xmlns="http://outerx.org/ns/regexslt/transform/1.0"> <element name="feed"> <element name="title"> <text>Bloomberg > Technology</text> </element> <element name="url"> <text>http://www.bloomberg.com/bbn/technology.html</text> </element> <call-matcher name="feeddate"/> <call-matcher name="items"/> </element> <matcher regex="CLASS="story3">([^<]+)<BR></SPAN>< /FONT></STRONG><FONT\sCOLOR="#333333"\sFACE=" sans-serif,\sarial"><SPAN\sCLASS="story">([^< ]+)&nbsp;(.+)<A\sHREF="([^"]+)">More" name="items"> <element name="item"> <element name="blurb"> <value-of select-group="1"/> </element> <element name="body"> <value-of select-group="2"/> </element> <element name="url"> <value-of select-group="4"/> </element> </element> </matcher> <matcher regex="<SPAN\sCLASS="date">([^<]+)</SPAN>" name="feeddate"> <element name="date"> <value-of select-group="1"/> </element> </matcher> </regexslt> it is transformed into <?xml version="1.0" encoding="UTF-8"?> <feed> <title>Bloomberg > Technology</title> <url>http://www.bloomberg.com/bbn/technology.html</url> <date>Wed, 09 Jan 2002, 3:48pm EST</date> <item> <blurb>Oracle, BEA, Software Stocks Surge After SAP Says 2001 Sales Beat Forecast</blurb> <body>The shares of Oracle Corp., BEA Systems Inc. and other software companies surged after SAP AG, the largest maker of business-management programs, said it surpassed a lowered 2001 sales forecast.</body> <url>http://quote.bloomberg.com/fgcgi.cgi?ptitle=Technology%20News&s 1=blk&tp=ad_topright_tech&T=markets_bfgcgi_content99.ht&s2=a d_right1_technology&bt=ad_position1_technology&middle=ad_frame2_ technology&s=APDyfihUCT3JhY2xl</url> </item> [...] </feed> One of the things which doesn't work well currently is the specification of the regex as an attribute to the <matcher> element. We will avoid this by putting the regex inside a CDATA section of a <regex> subelement (will be optional, we are testing this right now). Not sure whether this is good practice, advice welcome. It is only partially related to this discussion of course. We plan on releasing regexslt "when it's ready" (weeks, not months) under a liberal license (ASF). People who are willing to play around with it can contact me. There's an XML Schema for the language also (we found validation of the transformationsheet very important). But we would much more appreciate criticism and suggestions from the people on this thread :-) Pointers to other regex libraries which are more up to par with Perl regexes would be welcome, too. Regards, Steven Noels http://outerthought.org/ (+32)478 292900 XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
RE: Regular expression functions (W, Michael Kay | Thread | Re: Regular expression functions (W, Jeni Tennison |
RE: mapping (Was: Re: [xsl] Re: . i, naha | Date | Re: [xsl] Content constructors and , naha |
Month |