Re: [xsl] Analyzing text by extracting substrings that match regex patterns

Subject: Re: [xsl] Analyzing text by extracting substrings that match regex patterns
From: "David Carlisle d.p.carlisle@xxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Fri, 14 Mar 2025 09:52:37 -0000
It isn't clear if you are just looking for xpath solutions or if xsl is OK.

xsl:analyze-string

was introduced for exactly this kind of use, it gives access to each
substring matched by a ()-group in the regex.

Newer xpath versions even have an xpath version of this as well

https://qt4cg.org/specifications/xpath-functions-40/Overview.html#func-analyze-string

David


On Thu, 13 Mar 2025 at 22:01, Roger L Costello costello@xxxxxxxxx <
xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote:

> One way of analyzing text is to look for patterns in the text-find the
> substrings that match the patterns.
>
> The following XPath determines if the string in $TEXT contains a vowel:
>
> matches($TEXT, 'A|E|I|O|U')
>
> If there is a match then which vowel was matched? A? E? I? O? or U? It may
> be desirable to know that. The matches() function doesn't tell.
>
> We might wish to know if $TEXT contains integers. If the value of $TEXT is
>
> "The person put 12 dollars into the jar"
>
> and we look in $TEXT for the pattern [0-9]+ then we expect to get the
> answer "12".
>
> General Problem Statement: If the value of $TEXT is a string and the value
> of $PATTERN is a regex pattern, then extract the substring of $TEXT that
> matches $PATTERN.
>
> As stated above, the XPath matches() function indicates whether $PATTERN
> is matched in $TEXT, but it does not tell you the substring that matched.
>
> Liam Quin provided a wicked cool XPath expression that does the job:
>
> replace($TEXT,'^.*?(' || $PATTERN || ').*$', '$1')
>
> Let's see what that is doing. The value of the first argument $TEXT is a
> string. We want some or all of the string replaced. The second argument
> identifies the portion of the string to be replaced. Suppose the value of
> $PATTERN is [0-9]+ then the second argument is this regex:
>
> ^.*?([0-9]+).*$
>
> where,
>
> ^ means "start of the string"
>
> .*? means "gobble up zero or more of any character, but gobble only until
> you get to a string that match the next part of the regex (the question
> mark signals that the characters are to be gobbled up in a 'non-greedy'
> fashion)."
>
> ([0-9]+) means "one or more digits, i.e., an integer. By wrapping [0-9]+
> within parentheses, we can refer to it using $1."
>
> .* means "zero or more of any character."
>
> $ means "end of the string"
>
> The third argument of replace() is $1. Recall that $1 refers to [0-9]+.
>
> Thus, here's what Liam's solution is doing: replace the entire string in
> $TEXT with the substring that matches [0-9]+.
>
> Examples:
>
> replace('The person put 12 dollars into the jar', '^.*?([0-9]+).*$', '$1')
> returns '12'
>
> replace('HELLO WORLD', '^.*?( A|E|I|O|U).*$', '$1') returns 'E'
>
> From a technologist point of view, using replace() to extract the
> substring that matches the pattern is wicked cool, but the solution is
> complex and nonintuitive-why would anyone think "replace" when trying to
> obtain the substring matched by a pattern?
>
> SNOBOL solves the problem much more simply.
>
> In order to find out which string is matched, a name may be attached to a
> pattern. If the pattern matches, the matched substring is assigned as value
> to the name. This feature is called value assignment in pattern matching. A
> name is attached to a pattern by using the binary assignment operator
> indicated by a period. For example
>
> NVOWEL = VOWEL . V
>
> assigns NVOWEL a pattern that matches a vowel. The name V is attached to
> this pattern. If the pattern matches, the substring it matches is assigned
> to V. For example, if the value of TEXT is "HELLO WORLD", the statement
>
> TEXT NVOWEL
>
> succeeds and the string E is assigned to V
>
> Names may be assigned to components of patterns in as many places as
> desired. In this way the substrings matched by different components of the
> pattern can be determined. A simple example is
>
> DVOWEL = VOWEL . V1 VOWEL . V2
>
> in which V1 is attached to the first vowel and V2 to the second. If DVOWEL
> matches, the individual vowels are assigned to V1 and V2.
>
> Let's see how to express the example shown above where we extract the
> integer in a string.
>
> SNOBOL does not use [0-9]+ to represent a series of digits; instead,
> SNOBOL uses
>
> SPAN('0123456789')
>
> which matches a contiguous sequence of digits. So, here's how to extract
> the integer in TEXT:
>
> TEXT SPAN('0123456789') . V
>
> If the value of TEXT is 'The person put 12 dollars into the jar' then the
> value of V is '12'
>
> In my opinion, SNOBOL provides a superior solution to the task of
> extracting the substring of TEXT that matches PATTERN.

Current Thread