|
Subject: [xsl] Analyzing text by extracting substrings that match regex patterns From: "Roger L Costello costello@xxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> Date: Thu, 13 Mar 2025 22:00:57 -0000 |
One way of analyzing text is to look for patterns in the text-find the
substrings that match the patterns.
The following XPath determines if the string in $TEXT contains a vowel:
matches($TEXT, 'A|E|I|O|U')
If there is a match then which vowel was matched? A? E? I? O? or U? It may be
desirable to know that. The matches() function doesn't tell.
We might wish to know if $TEXT contains integers. If the value of $TEXT is
"The person put 12 dollars into the jar"
and we look in $TEXT for the pattern [0-9]+ then we expect to get the answer
"12".
General Problem Statement: If the value of $TEXT is a string and the value of
$PATTERN is a regex pattern, then extract the substring of $TEXT that matches
$PATTERN.
As stated above, the XPath matches() function indicates whether $PATTERN is
matched in $TEXT, but it does not tell you the substring that matched.
Liam Quin provided a wicked cool XPath expression that does the job:
replace($TEXT,'^.*?(' || $PATTERN || ').*$', '$1')
Let's see what that is doing. The value of the first argument $TEXT is a
string. We want some or all of the string replaced. The second argument
identifies the portion of the string to be replaced. Suppose the value of
$PATTERN is [0-9]+ then the second argument is this regex:
^.*?([0-9]+).*$
where,
^ means "start of the string"
.*? means "gobble up zero or more of any character, but gobble only until you
get to a string that match the next part of the regex (the question mark
signals that the characters are to be gobbled up in a 'non-greedy' fashion)."
([0-9]+) means "one or more digits, i.e., an integer. By wrapping [0-9]+
within parentheses, we can refer to it using $1."
.* means "zero or more of any character."
$ means "end of the string"
The third argument of replace() is $1. Recall that $1 refers to [0-9]+.
Thus, here's what Liam's solution is doing: replace the entire string in $TEXT
with the substring that matches [0-9]+.
Examples:
replace('The person put 12 dollars into the jar', '^.*?([0-9]+).*$', '$1')
returns '12'
replace('HELLO WORLD', '^.*?( A|E|I|O|U).*$', '$1') returns 'E'
| Current Thread |
|---|
|
| <- Previous | Index | Next -> |
|---|---|---|
| Re: [xsl] Json to xml, Martin Honnen martin | Thread | Re: [xsl] Analyzing text by extract, John Lumley john@xxx |
| [xsl] Json to xml, dvint@xxxxxxxxx | Date | Re: [xsl] Analyzing text by extract, John Lumley john@xxx |
| Month |