Re: [xsl] Need an XPath expression which returns all xs:pattern elements containing a regex that permits an unbounded number of characters

Subject: Re: [xsl] Need an XPath expression which returns all xs:pattern elements containing a regex that permits an unbounded number of characters
From: "C. M. Sperberg-McQueen cmsmcq@xxxxxxxxxxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Thu, 4 Apr 2024 18:05:08 -0000
You seem to be close to a reasonably good solution already.

Unless I'm missing something, you've identified the only four ways that
a regular expression can match an unbounded number of characters: the *
and + operators, and a quantifier with a comma but no second argument.
That's a good start, I think.  Either of the first two can be escaped
with a single backslash, and none of them has a meaning as a quantifier
within a square-bracketed character-class expression.

The simplest first approximation would be very like the one you have
already tried: search for "*" or '+' or ',}' (it's a mistake to search
for '{1,}' or '{0,}' because an expression like "a{4,}" also matches
strings of unbounded length; I am assuming you don't know in advance
that the only minimum values used in numeric quantifiers are 0 and 1).

So something like:

    xs:pattern[matches(@value, "[*+]|,\}")]

As you have noticed, that pulls up false positives like '\*'.

A better approximation would be to search for any of:

  - '*' when not preceded by a backslash    
  - '+' when not preceded by a backslash
  - ',}'

I believe the string ",}" can appear in a legal XSD regular expression
only as part of a quantifier:  "\,}" would escape the comma, but the
right bracket is not allowed without an escape, so an escaped form would
be ",\}", which won't match the string ",}".

So something like:

    xs:pattern[matches(@value, "[^\\][*+]|,\}")]

This second approximation will eliminate some false positives, but it
will still return a false positive on a pattern like "[?*+{,}]?", since
the characters of interest to us need not be escaped within a character
class expression.  It also will produce a false negative on "\\*", which
matches any number of backslash characters.

A third approximation would ensure that we don't match * or + after a
single backslash, or between (unescaped) left and right square brackets,
by first imagining a simple finite state automaton and then translating
it into a regular expression.

  - in the NORMAL state:
    . a star or plus takes us to state MATCH
    . a comma takes us to state COMMA
    . a backslash takes us to state ESC
    . a left bracket takes us to state LB
    . anything else leaves us in state NORMAL
  - in state COMMA
    . a right brace takes us to MATCH
    . anything else takes us to NORMAL
  - in state ESC
    . any character takes us to NORMAL
  - in state LB
    . a right bracket takes us to NORMAL
    . a backslash takes us to state LBESC
    . anything else leaves us in state LB
  - in state LBESC
    . any character takes us to state LB

So: the regex should allow any number of excursions to state COMMA, ESC,
or LB, followed by one of the strings we are looking for:

  "((,[^}])|(\\.)|(\[(\\.)*[^\]]\]))*([*+]|,\})"

Since character class expressions can nest in XSD, you can have
expressions like

  [\p{L}-[a-z]]

which means that in principle square brackets can nest arbitrarily deep,
and you would have to keep a stack in order to know reliably when you
get back to the normal state, outside of all square bracket pairs.  But
since a nested character class expression can occur only as the last
child of its parent, you don't need to keep track in practice:  as soon
as you see the first unescaped right bracket in state LB, you will in
any well formed expression see a series of right brackets. None of them
will match star, plus, or comma-right-brace, so there is no need to keep
a stack.

Note, however, that matching braces in XPath is complicated by the fact
that they often have special meaning in XPath.  If you can find a good
explanation of the escaping rules, read it before you try to make the
expression above work.  (If it were me, I'd place a bet on the sequence
comma plus right brace never occurring within a character class
expression, and use regex matching to deal with the escaping of * and +
and just use contains() to look for occurrences of ",}".  Of course, if
it were me, the cost of false positive here and there would be low --
your mileage may differ.)

I hope this helps.

"Roger L Costello costello@xxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> writes:

> Hi Folks,
>
> I want to find, in an XML Schema, all xs:pattern elements containing a regex that permits an unbounded number of characters.
>
> Here are examples of xs:pattern elements that I want to find:
>
> <xs:pattern value="A*"/>
> <xs:pattern value="A+"/>
> <xs:pattern value="A{0,.}"/>
> <xs:pattern value="A{1,.}"/>
>
> I do not want either of the following xs:pattern elements because -- due to the escape symbol -- they do not permit an unbounded number of characters:
>
> <xs:pattern value="A\*"/>
> <xs:pattern value="A\+"/>
>
> I created an XPath 2.0 expression to find the desired xs:pattern elements:
>
> xs:pattern[
>         contains(@value, '*') or 
>         contains(@value, '+') or 
>         contains(@value, '{1,}') or 
>         contains(@value, '{0,}')
>     ]
>
> Eek! That is not correct. It incorrectly returns the xs:pattern elements with escaped asterisk and escaped plus symbols:
>
> <xs:pattern value="A\*"/>
> <xs:pattern value="A\+"/>
>
> How to fix my XPath expression? Is the solution to add a second predicate:
>
> xs:pattern[
>         contains(@value, '*') or 
>         contains(@value, '+') or 
>         contains(@value, '{1,}') or 
>         contains(@value, '{0,}')
>     ][
>         not(contains(@value, '\*')) and
>         not(contains(@value, '\+'))
>     ]
>
> Is that correct? Is that the best approach? Is there a better approach?
>
> Bonus points if you can answer this question: Is my XPath expression catching all xs:pattern elements that have a regex that permits an unbounded number of characters?
>
> Note: For reasons that I will not explain, the XPath expression must be an XPath 2.0 expression.
>
> /Roger
>
>
>  
> 


-- 
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
http://blackmesatech.com

Current Thread