[xsl] Analyzing text by extracting substrings that match regex patterns

Subject: [xsl] Analyzing text by extracting substrings that match regex patterns
From: "Roger L Costello costello@xxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Thu, 13 Mar 2025 22:00:57 -0000
One way of analyzing text is to look for patterns in the text-find the
substrings that match the patterns.

The following XPath determines if the string in $TEXT contains a vowel:

matches($TEXT, 'A|E|I|O|U')

If there is a match then which vowel was matched? A? E? I? O? or U? It may be
desirable to know that. The matches() function doesn't tell.

We might wish to know if $TEXT contains integers. If the value of $TEXT is

"The person put 12 dollars into the jar"

and we look in $TEXT for the pattern [0-9]+ then we expect to get the answer
"12".

General Problem Statement: If the value of $TEXT is a string and the value of
$PATTERN is a regex pattern, then extract the substring of $TEXT that matches
$PATTERN.

As stated above, the XPath matches() function indicates whether $PATTERN is
matched in $TEXT, but it does not tell you the substring that matched.

Liam Quin provided a wicked cool XPath expression that does the job:

replace($TEXT,'^.*?(' || $PATTERN || ').*$', '$1')

Let's see what that is doing. The value of the first argument $TEXT is a
string. We want some or all of the string replaced. The second argument
identifies the portion of the string to be replaced. Suppose the value of
$PATTERN is [0-9]+ then the second argument is this regex:

^.*?([0-9]+).*$

where,

^ means "start of the string"

.*? means "gobble up zero or more of any character, but gobble only until you
get to a string that match the next part of the regex (the question mark
signals that the characters are to be gobbled up in a 'non-greedy' fashion)."

([0-9]+) means "one or more digits, i.e., an integer. By wrapping [0-9]+
within parentheses, we can refer to it using $1."

.* means "zero or more of any character."

$ means "end of the string"

The third argument of replace() is $1. Recall that $1 refers to [0-9]+.

Thus, here's what Liam's solution is doing: replace the entire string in $TEXT
with the substring that matches [0-9]+.

Examples:

replace('The person put 12 dollars into the jar', '^.*?([0-9]+).*$', '$1')
returns '12'

replace('HELLO WORLD', '^.*?( A|E|I|O|U).*$', '$1') returns 'E'

Current Thread