Subject: Re: [xsl] Analyzing text by extracting substrings that match regex patterns From: "David Carlisle d.p.carlisle@xxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> Date: Fri, 14 Mar 2025 09:52:37 -0000 |
It isn't clear if you are just looking for xpath solutions or if xsl is OK. xsl:analyze-string was introduced for exactly this kind of use, it gives access to each substring matched by a ()-group in the regex. Newer xpath versions even have an xpath version of this as well https://qt4cg.org/specifications/xpath-functions-40/Overview.html#func-analyze-string David On Thu, 13 Mar 2025 at 22:01, Roger L Costello costello@xxxxxxxxx < xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote: > One way of analyzing text is to look for patterns in the text-find the > substrings that match the patterns. > > The following XPath determines if the string in $TEXT contains a vowel: > > matches($TEXT, 'A|E|I|O|U') > > If there is a match then which vowel was matched? A? E? I? O? or U? It may > be desirable to know that. The matches() function doesn't tell. > > We might wish to know if $TEXT contains integers. If the value of $TEXT is > > "The person put 12 dollars into the jar" > > and we look in $TEXT for the pattern [0-9]+ then we expect to get the > answer "12". > > General Problem Statement: If the value of $TEXT is a string and the value > of $PATTERN is a regex pattern, then extract the substring of $TEXT that > matches $PATTERN. > > As stated above, the XPath matches() function indicates whether $PATTERN > is matched in $TEXT, but it does not tell you the substring that matched. > > Liam Quin provided a wicked cool XPath expression that does the job: > > replace($TEXT,'^.*?(' || $PATTERN || ').*$', '$1') > > Let's see what that is doing. The value of the first argument $TEXT is a > string. We want some or all of the string replaced. The second argument > identifies the portion of the string to be replaced. Suppose the value of > $PATTERN is [0-9]+ then the second argument is this regex: > > ^.*?([0-9]+).*$ > > where, > > ^ means "start of the string" > > .*? means "gobble up zero or more of any character, but gobble only until > you get to a string that match the next part of the regex (the question > mark signals that the characters are to be gobbled up in a 'non-greedy' > fashion)." > > ([0-9]+) means "one or more digits, i.e., an integer. By wrapping [0-9]+ > within parentheses, we can refer to it using $1." > > .* means "zero or more of any character." > > $ means "end of the string" > > The third argument of replace() is $1. Recall that $1 refers to [0-9]+. > > Thus, here's what Liam's solution is doing: replace the entire string in > $TEXT with the substring that matches [0-9]+. > > Examples: > > replace('The person put 12 dollars into the jar', '^.*?([0-9]+).*$', '$1') > returns '12' > > replace('HELLO WORLD', '^.*?( A|E|I|O|U).*$', '$1') returns 'E' > > From a technologist point of view, using replace() to extract the > substring that matches the pattern is wicked cool, but the solution is > complex and nonintuitive-why would anyone think "replace" when trying to > obtain the substring matched by a pattern? > > SNOBOL solves the problem much more simply. > > In order to find out which string is matched, a name may be attached to a > pattern. If the pattern matches, the matched substring is assigned as value > to the name. This feature is called value assignment in pattern matching. A > name is attached to a pattern by using the binary assignment operator > indicated by a period. For example > > NVOWEL = VOWEL . V > > assigns NVOWEL a pattern that matches a vowel. The name V is attached to > this pattern. If the pattern matches, the substring it matches is assigned > to V. For example, if the value of TEXT is "HELLO WORLD", the statement > > TEXT NVOWEL > > succeeds and the string E is assigned to V > > Names may be assigned to components of patterns in as many places as > desired. In this way the substrings matched by different components of the > pattern can be determined. A simple example is > > DVOWEL = VOWEL . V1 VOWEL . V2 > > in which V1 is attached to the first vowel and V2 to the second. If DVOWEL > matches, the individual vowels are assigned to V1 and V2. > > Let's see how to express the example shown above where we extract the > integer in a string. > > SNOBOL does not use [0-9]+ to represent a series of digits; instead, > SNOBOL uses > > SPAN('0123456789') > > which matches a contiguous sequence of digits. So, here's how to extract > the integer in TEXT: > > TEXT SPAN('0123456789') . V > > If the value of TEXT is 'The person put 12 dollars into the jar' then the > value of V is '12' > > In my opinion, SNOBOL provides a superior solution to the task of > extracting the substring of TEXT that matches PATTERN.
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] Analyzing text by extract, Liam R. E. Quin liam | Thread | [no subject], Unknown |
Re: [xsl] Json to xml, Martin Honnen martin | Date | [xsl] Help, my problem is n-cubed ., Roger L Costello cos |
Month |