Subject: [xsl] Analyzing text by extracting substrings that match regex patterns From: "Roger L Costello costello@xxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> Date: Thu, 13 Mar 2025 22:00:57 -0000 |
One way of analyzing text is to look for patterns in the text-find the substrings that match the patterns. The following XPath determines if the string in $TEXT contains a vowel: matches($TEXT, 'A|E|I|O|U') If there is a match then which vowel was matched? A? E? I? O? or U? It may be desirable to know that. The matches() function doesn't tell. We might wish to know if $TEXT contains integers. If the value of $TEXT is "The person put 12 dollars into the jar" and we look in $TEXT for the pattern [0-9]+ then we expect to get the answer "12". General Problem Statement: If the value of $TEXT is a string and the value of $PATTERN is a regex pattern, then extract the substring of $TEXT that matches $PATTERN. As stated above, the XPath matches() function indicates whether $PATTERN is matched in $TEXT, but it does not tell you the substring that matched. Liam Quin provided a wicked cool XPath expression that does the job: replace($TEXT,'^.*?(' || $PATTERN || ').*$', '$1') Let's see what that is doing. The value of the first argument $TEXT is a string. We want some or all of the string replaced. The second argument identifies the portion of the string to be replaced. Suppose the value of $PATTERN is [0-9]+ then the second argument is this regex: ^.*?([0-9]+).*$ where, ^ means "start of the string" .*? means "gobble up zero or more of any character, but gobble only until you get to a string that match the next part of the regex (the question mark signals that the characters are to be gobbled up in a 'non-greedy' fashion)." ([0-9]+) means "one or more digits, i.e., an integer. By wrapping [0-9]+ within parentheses, we can refer to it using $1." .* means "zero or more of any character." $ means "end of the string" The third argument of replace() is $1. Recall that $1 refers to [0-9]+. Thus, here's what Liam's solution is doing: replace the entire string in $TEXT with the substring that matches [0-9]+. Examples: replace('The person put 12 dollars into the jar', '^.*?([0-9]+).*$', '$1') returns '12' replace('HELLO WORLD', '^.*?( A|E|I|O|U).*$', '$1') returns 'E'
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] Json to xml, Martin Honnen martin | Thread | Re: [xsl] Analyzing text by extract, John Lumley john@xxx |
[xsl] Json to xml, dvint@xxxxxxxxx | Date | Re: [xsl] Analyzing text by extract, John Lumley john@xxx |
Month |