Re: [xsl] I desire this function: substring-before(string, regex charset)

Subject: Re: [xsl] I desire this function: substring-before(string, regex charset)
From: "G. Ken Holman g.ken.holman@xxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Sun, 13 Apr 2025 12:59:44 -0000
Wouldn't the non-greedy '?' version of regex in XPath help you here?

replace('THEN THE CURTAIN FELL','(^.*?)[AEIOU].*$','$1')

... returns "TH".

I hope this helps. But I haven't tested edge cases.

. . . . . . Ken

At 13/04/2025 12:43 +0000, Roger L Costello costello@xxxxxxxxx wrote:
Hi Folks,

The XPath substring-before function returns "that part of the given input string that occurs before the first occurrence of the string given in $arg2." [definition from SAXON web page]

substring-before($arg1 as xs:string?, $arg2 as xs:string?) --> xs:string

It's a shame that the value of $arg2 can't be a regex character set, e.g.,

substring-before("THEN THE CURTAIN FELL", '[AEIOU]')

returns TH.

Even better, it would be nice if there was a third argument which specified that you also want the character that was matched from the character set:

substring-before("THEN THE CURTAIN FELL", '[AEIOU]', 'plus matched charset character')

returns THE.

I believe such a function would be useful.

SNOBOL has such a function.

Let's see how such functionality could be used. I have this text:

THEN THE CURTAIN FELL

Fetch the string preceding the first vowel, plus the vowel:

THE

However, instead of fetching the string plus vowel, modify the text by nullifying the string plus vowel:

N THE CURTAIN FELL

Repeat on the new, shortened text.

Here is the text as it is repeatedly shortened:

THEN THE CURTAIN FELL
N THE CURTAIN FELL
 CURTAIN FELL
RTAIN FELL
IN FELL
N FELL
LL

General Problem Statement: There is a text string. There is a character set. Strip off the string prior to the first occurrence of a character from the character set, plus the character. Repeat until the end of text is reached.

Below I show how to implement this in SNOBOL and then in XSLT. My XSLT solution is large and complex. Is there a simpler, shorter solution?

First, the SNOBOL solution:

Assign the variable TEXT a string:

TEXT = "THEN THE CURTAIN FELL"

BREAK is a built-in SNOBOL function. It has one argument, which is a character set. BREAK returns a pattern that matches a string up to but not including the character from the character set. E.g.,

BREAK("AEIOU")

returns a pattern that matches characters up to but not including a vowel. This pattern:

BREAK("AEIOU") LEN(1)

matches characters up to a vowel, plus the vowel.

Note: LEN(N) means, match any N-length character string. It is SNOBOL's version of the regex .{N}

The following statement applies the pattern to TEXT, replacing the string plus vowel with null:

TEXT BREAK("AEIOU") LEN(1) =

To incrementally strip away the string, put the statement inside a loop:

LOOP TEXT BREAK("AEIOU") LEN(1) = :F(END)
OUTPUT = TEXT :(LOOP)


Here is the output from running the SNOBOL program:

THEN THE CURTAIN FELL
N THE CURTAIN FELL
 CURTAIN FELL
RTAIN FELL
IN FELL
N FELL
LL

Nice.

Below is my XSLT solution. It uses the replace idea that Liam provided a few weeks back, which is neat. Whereas the SNOBOL solution takes only 2 lines of code, the XSLT solution requires many lines of code. Is there a simpler, shorter solution?

Lesson Learned: when designing a new language, it might be useful for the language to provide something like the SNOBOL BREAK function.

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
                xmlns:xs="http://www.w3.org/2001/XMLSchema";
                xmlns:f="function"
                exclude-result-prefixes="#all"
                version="3.0">

<xsl:function name="f:remove-up-to-vowel" as="xs:string*">
<xsl:param name="TEXT" as="xs:string"/>
<xsl:choose>
<!-- end of string? -->
<xsl:when test="$TEXT eq ''"/>
<xsl:otherwise>
<xsl:variable name="substring-after-vowel"
select="replace($TEXT, '^[^AEIOU]*[AEIOU](.*)$', '$1')"
as="xs:string*"/>
<xsl:sequence select="$substring-after-vowel"/>
<xsl:choose>
<xsl:when test="not(matches($substring-after-vowel,'[AEIOU]'))"/>
<xsl:otherwise>
<xsl:sequence select="f:remove-up-to-vowel($substring-after-vowel)"/>
</xsl:otherwise>
</xsl:choose>
</xsl:otherwise>
</xsl:choose>
</xsl:function>


    <xsl:template match="/*">
        <xsl:variable name="result"
                   select="f:remove-up-to-vowel('THEN THE CURTAIN FELL')"
                  as="xs:string*"/>
        <xsl:for-each select="$result">
            <xsl:message>
                <xsl:value-of select="."/>
            </xsl:message>
        </xsl:for-each>
    </xsl:template>

</xsl:stylesheet>



--
Contact info, blog, articles, etc. http://www.CraneSoftwrights.com/s/ |
Check our site for free XML, XSLT, XSL-FO and UBL developer resources |
Streaming hands-on XSLT/XPath 2 training class @US$125 (5 hours free) |
Essays (UBL, XML, etc.) http://www.linkedin.com/today/author/gkholman |

Current Thread