Subject: Re: Regular expression functions (Was: Re: [xsl] comments on December F&O draft) From: Jeni Tennison <jeni@xxxxxxxxxxxxxxxx> Date: Sun, 13 Jan 2002 10:28:23 +0000 |
I wrote: > I'll think some more... And of course had an idea immediately I went to bed, and therefore couldn't sleep... In XSLT, you *select* a bunch of nodes to process, the processor goes through them one by one, and you have templates that *match* those nodes and provide whatever output you want for them. This has proved a very flexible way of going about things, especially in cases where you have deeply nested, unpredictable structures. So to deal with strings that have deeply nested, unpredictable structures (such as David's example), perhaps that same kind of approach would work. You need a way of selecting a sequence of strings and applying templates to them, where the templates have regular expression patterns. Something along the lines of: <!-- Category: instruction --> <xsl:apply-regexp-templates select = string-sequence-expression mode = qname> <!-- Content: (xsl:sort | xsl:with-param)* --> </xsl:apply-regexp-templates> You also need something for declaring regular expression templates that match those strings. Something along the lines of: <!-- Category: declaration --> <xsl:regexp-template match = regular-expression priority = number mode = qname> <!-- Content: (xsl:param*, content-constructor) --> </xsl:regexp-template> When you use xsl:apply-regexp-templates, the processor goes through the string sequence one string at a time in the sorted order (or original order) and tries to find a template that matches the entire string. It finds the highest-priority template that matches the entire string (note that there are no implied priorities, so you have to use the priority attribute if a string might match more than one template), and uses that to create content. Modes and parameters work in the usual way. Within the xsl:regexp-template element, the context item is the string that's matched by the template; the context position is its position within the (sorted) string sequence to which regexp templates were applied; the context size is the length of that sequence. The evaluation context includes a current match, which is a sequence of strings - the subexpressions from the match regular expression. You can retrieve this sequence using the current-match() function. [Or something along those lines - there are lots of possibilities for how you get hold of that information.] Taking a simple example: <xsl:apply-regexp-templates select="'13/1/02'" mode="date" /> The processor applies templates to the date; there are multiple templates in date mode (for different date formats), but the one that matches with the highest priority is: <xsl:regexp-template match="([0-9]{1,2})/([0-9]{1,2})/([0-9]{2})" mode="date"> <xsl:variable name="day" select="format-number(current-match()[1], '00')" /> <xsl:variable name="month" select="format-number(current-match()[2], '00')" /> <xsl:variable name="year" select="if (current-match()[3] > 30) then (current-match()[3] + 1900) else (current-match()[3] + 2000)" /> <xsl:value-of select="($year, $month, $day)" separator="-" /> </xsl:regexp-template> To supplement the template pattern, there should be an instruction that merges the xsl:apply-regexp-templates and the xsl:regexp-template: <!-- Category: instruction --> <xsl:match select = string-expression regexp = regular-expression> <!-- Content: (xsl:sort*, content-constructor) --> </xsl:match> For simple cases like the above, this allows you to just do: <xsl:match select="'13/1/02'" regexp="([0-9]{1,2})/([0-9]{1,2})/([0-9]{2})"> <xsl:variable name="day" select="format-number(current-match()[1], '00')" /> <xsl:variable name="month" select="format-number(current-match()[2], '00')" /> <xsl:variable name="year" select="if (current-match()[3] > 30) then (current-match()[3] + 1900) else (current-match()[3] + 2000)" /> <xsl:value-of select="($year, $month, $day)" separator="-" /> </xsl:match> To make it easier to construct string sequences to which to apply regular expression templates, I suggest a function (or two, perhaps, given the general avoidance of function overloading) that basically tokenises a string based on a regular expression. The signature of the function would be: tokenize(string $string, string $regexp) => string* tokenize(string $string, string $start-regexp, string $end-regexp) => string* The first form splits $string into a sequence of strings. Every even string matches the $regexp. For example: tokenize(' foo bar baz', '\s+') => ('', ' ', 'foo', ' ', 'bar', ' ', 'baz') The second form does a similar thing, except that the even-positioned strings must begin with the $start-regexp and end with the $end-regexp. What's more, each even string in the result must be balanced - it must contain an equal number of substrings matching the $start-regexp as match the $end-regexp (with no overlapping). For example: tokenize('this is \bold{bold \italic{and italic}} text', '\\[a-z]+\{', '\}') => ('this is ', '\bold{bold \italic{and italic}}', ' text') Note that any odd string in the result may contain a substring that matches the $end-regexp; similarly, the last string in the result may start with a match for the $start-regexp, if there's no matching $end-regexp. Also, in some strings the substring matching the $start-regexp may overlap with the substring matching the $end-regexp. To make it easier to manage formats like messy HTML, where you need the $end-regexp to contain something from the $start-regexp, $end-regexp can contain back references to subexpressions within $start-regexp, in the form \1...\N. For example (not escaping <s for readability): tokenize('this <img src="glyph.gif"> is <b>bold</b> text', '<([a-z]+)>', '</\1>') => ('this ', '<img src="glyph.gif"> is <b>bold</b> text') The fact that the tokenize() function takes regular expression strings means that it's possible to construct regular expressions on the fly. The fact that you *can't* construct regular expressions with the other regular expression constructs (they don't have attribute value templates), means that they can be parsed when the processor first reads the stylesheet rather than at runtime, which is good for efficiency, I think, especially considering how many regular expression templates you might have. I think that the regular expressions in tokenize() give you all you actually need. For example, to go through a piece of text and add an em element around every occurrence of $keyword (as a whole word) in the text, you could use: <xsl:for-each select="tokenize($text, concat('\W+', $keyword, '\W+'))"> <xsl:choose> <xsl:when test="position() mod 2 = 1"> <xsl:value-of select="." /> </xsl:when> <xsl:otherwise> <xsl:for-each select="tokenize(., '\W+')"> <xsl:choose> <xsl:when test="position() mod 2 = 0"> <xsl:value-of select="." /> </xsl:when> <xsl:otherwise> <em> <xsl:value-of select="." /> </em> </xsl:otherwise> </xsl:choose> </xsl:for-each> </xsl:otherwise> </xsl:choose> </xsl:for-each> But if you have a static regular expression (and you don't have to worry about bracket balancing) it's simpler to use xsl:match or xsl:apply-regexp-templates instead: <xsl:for-each select="tokenize($text, concat('\W+', $keyword, '\W+'))"> <xsl:choose> <xsl:when test="position() mod 2 = 1"> <xsl:value-of select="." /> </xsl:when> <xsl:otherwise> <xsl:match select="." regexp="(\W+)(.*)(\W+)"> <xsl:value-of select="current-match()[1]" /> <em> <xsl:value-of select="current-match()[2]" /> </em> <xsl:value-of select="current-match()[3]" /> </xsl:match> </xsl:otherwise> </xsl:choose> </xsl:for-each> I think a lot of this could be refined, but that as a general approach it might be feasible. Any thoughts? Cheers, Jeni --- Jeni Tennison http://www.jenitennison.com/ XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: Regular expression functions (W, Jeni Tennison | Thread | Re: Regular expression functions (W, David Carlisle |
Re: [xsl] Re: Re: A question about , Jeni Tennison | Date | [xsl] Re: A question about the expr, Jeni Tennison |
Month |