|
Subject: Re: [xsl] Need an XPath expression which returns all xs:pattern elements containing a regex that permits an unbounded number of characters From: "C. M. Sperberg-McQueen cmsmcq@xxxxxxxxxxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> Date: Thu, 4 Apr 2024 18:05:08 -0000 |
You seem to be close to a reasonably good solution already.
Unless I'm missing something, you've identified the only four ways that
a regular expression can match an unbounded number of characters: the *
and + operators, and a quantifier with a comma but no second argument.
That's a good start, I think. Either of the first two can be escaped
with a single backslash, and none of them has a meaning as a quantifier
within a square-bracketed character-class expression.
The simplest first approximation would be very like the one you have
already tried: search for "*" or '+' or ',}' (it's a mistake to search
for '{1,}' or '{0,}' because an expression like "a{4,}" also matches
strings of unbounded length; I am assuming you don't know in advance
that the only minimum values used in numeric quantifiers are 0 and 1).
So something like:
xs:pattern[matches(@value, "[*+]|,\}")]
As you have noticed, that pulls up false positives like '\*'.
A better approximation would be to search for any of:
- '*' when not preceded by a backslash
- '+' when not preceded by a backslash
- ',}'
I believe the string ",}" can appear in a legal XSD regular expression
only as part of a quantifier: "\,}" would escape the comma, but the
right bracket is not allowed without an escape, so an escaped form would
be ",\}", which won't match the string ",}".
So something like:
xs:pattern[matches(@value, "[^\\][*+]|,\}")]
This second approximation will eliminate some false positives, but it
will still return a false positive on a pattern like "[?*+{,}]?", since
the characters of interest to us need not be escaped within a character
class expression. It also will produce a false negative on "\\*", which
matches any number of backslash characters.
A third approximation would ensure that we don't match * or + after a
single backslash, or between (unescaped) left and right square brackets,
by first imagining a simple finite state automaton and then translating
it into a regular expression.
- in the NORMAL state:
. a star or plus takes us to state MATCH
. a comma takes us to state COMMA
. a backslash takes us to state ESC
. a left bracket takes us to state LB
. anything else leaves us in state NORMAL
- in state COMMA
. a right brace takes us to MATCH
. anything else takes us to NORMAL
- in state ESC
. any character takes us to NORMAL
- in state LB
. a right bracket takes us to NORMAL
. a backslash takes us to state LBESC
. anything else leaves us in state LB
- in state LBESC
. any character takes us to state LB
So: the regex should allow any number of excursions to state COMMA, ESC,
or LB, followed by one of the strings we are looking for:
"((,[^}])|(\\.)|(\[(\\.)*[^\]]\]))*([*+]|,\})"
Since character class expressions can nest in XSD, you can have
expressions like
[\p{L}-[a-z]]
which means that in principle square brackets can nest arbitrarily deep,
and you would have to keep a stack in order to know reliably when you
get back to the normal state, outside of all square bracket pairs. But
since a nested character class expression can occur only as the last
child of its parent, you don't need to keep track in practice: as soon
as you see the first unescaped right bracket in state LB, you will in
any well formed expression see a series of right brackets. None of them
will match star, plus, or comma-right-brace, so there is no need to keep
a stack.
Note, however, that matching braces in XPath is complicated by the fact
that they often have special meaning in XPath. If you can find a good
explanation of the escaping rules, read it before you try to make the
expression above work. (If it were me, I'd place a bet on the sequence
comma plus right brace never occurring within a character class
expression, and use regex matching to deal with the escaping of * and +
and just use contains() to look for occurrences of ",}". Of course, if
it were me, the cost of false positive here and there would be low --
your mileage may differ.)
I hope this helps.
"Roger L Costello costello@xxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> writes:
> Hi Folks,
>
> I want to find, in an XML Schema, all xs:pattern elements containing a regex that permits an unbounded number of characters.
>
> Here are examples of xs:pattern elements that I want to find:
>
> <xs:pattern value="A*"/>
> <xs:pattern value="A+"/>
> <xs:pattern value="A{0,.}"/>
> <xs:pattern value="A{1,.}"/>
>
> I do not want either of the following xs:pattern elements because -- due to the escape symbol -- they do not permit an unbounded number of characters:
>
> <xs:pattern value="A\*"/>
> <xs:pattern value="A\+"/>
>
> I created an XPath 2.0 expression to find the desired xs:pattern elements:
>
> xs:pattern[
> contains(@value, '*') or
> contains(@value, '+') or
> contains(@value, '{1,}') or
> contains(@value, '{0,}')
> ]
>
> Eek! That is not correct. It incorrectly returns the xs:pattern elements with escaped asterisk and escaped plus symbols:
>
> <xs:pattern value="A\*"/>
> <xs:pattern value="A\+"/>
>
> How to fix my XPath expression? Is the solution to add a second predicate:
>
> xs:pattern[
> contains(@value, '*') or
> contains(@value, '+') or
> contains(@value, '{1,}') or
> contains(@value, '{0,}')
> ][
> not(contains(@value, '\*')) and
> not(contains(@value, '\+'))
> ]
>
> Is that correct? Is that the best approach? Is there a better approach?
>
> Bonus points if you can answer this question: Is my XPath expression catching all xs:pattern elements that have a regex that permits an unbounded number of characters?
>
> Note: For reasons that I will not explain, the XPath expression must be an XPath 2.0 expression.
>
> /Roger
>
>
>
>
--
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
http://blackmesatech.com
| Current Thread |
|---|
|
| <- Previous | Index | Next -> |
|---|---|---|
| Re: [xsl] Need an XPath expression , Dimitre Novatchev dn | Thread | [xsl] Move Text, Byomokesh Sahoo saho |
| Re: [xsl] Need an XPath expression , Dimitre Novatchev dn | Date | Re: [xsl] Need an XPath expression , Liam R. E. Quin liam |
| Month |