Subject: Re: [xsl] Need an XPath expression which returns all xs:pattern elements containing a regex that permits an unbounded number of characters From: "C. M. Sperberg-McQueen cmsmcq@xxxxxxxxxxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> Date: Thu, 4 Apr 2024 18:05:08 -0000 |
You seem to be close to a reasonably good solution already. Unless I'm missing something, you've identified the only four ways that a regular expression can match an unbounded number of characters: the * and + operators, and a quantifier with a comma but no second argument. That's a good start, I think. Either of the first two can be escaped with a single backslash, and none of them has a meaning as a quantifier within a square-bracketed character-class expression. The simplest first approximation would be very like the one you have already tried: search for "*" or '+' or ',}' (it's a mistake to search for '{1,}' or '{0,}' because an expression like "a{4,}" also matches strings of unbounded length; I am assuming you don't know in advance that the only minimum values used in numeric quantifiers are 0 and 1). So something like: xs:pattern[matches(@value, "[*+]|,\}")] As you have noticed, that pulls up false positives like '\*'. A better approximation would be to search for any of: - '*' when not preceded by a backslash - '+' when not preceded by a backslash - ',}' I believe the string ",}" can appear in a legal XSD regular expression only as part of a quantifier: "\,}" would escape the comma, but the right bracket is not allowed without an escape, so an escaped form would be ",\}", which won't match the string ",}". So something like: xs:pattern[matches(@value, "[^\\][*+]|,\}")] This second approximation will eliminate some false positives, but it will still return a false positive on a pattern like "[?*+{,}]?", since the characters of interest to us need not be escaped within a character class expression. It also will produce a false negative on "\\*", which matches any number of backslash characters. A third approximation would ensure that we don't match * or + after a single backslash, or between (unescaped) left and right square brackets, by first imagining a simple finite state automaton and then translating it into a regular expression. - in the NORMAL state: . a star or plus takes us to state MATCH . a comma takes us to state COMMA . a backslash takes us to state ESC . a left bracket takes us to state LB . anything else leaves us in state NORMAL - in state COMMA . a right brace takes us to MATCH . anything else takes us to NORMAL - in state ESC . any character takes us to NORMAL - in state LB . a right bracket takes us to NORMAL . a backslash takes us to state LBESC . anything else leaves us in state LB - in state LBESC . any character takes us to state LB So: the regex should allow any number of excursions to state COMMA, ESC, or LB, followed by one of the strings we are looking for: "((,[^}])|(\\.)|(\[(\\.)*[^\]]\]))*([*+]|,\})" Since character class expressions can nest in XSD, you can have expressions like [\p{L}-[a-z]] which means that in principle square brackets can nest arbitrarily deep, and you would have to keep a stack in order to know reliably when you get back to the normal state, outside of all square bracket pairs. But since a nested character class expression can occur only as the last child of its parent, you don't need to keep track in practice: as soon as you see the first unescaped right bracket in state LB, you will in any well formed expression see a series of right brackets. None of them will match star, plus, or comma-right-brace, so there is no need to keep a stack. Note, however, that matching braces in XPath is complicated by the fact that they often have special meaning in XPath. If you can find a good explanation of the escaping rules, read it before you try to make the expression above work. (If it were me, I'd place a bet on the sequence comma plus right brace never occurring within a character class expression, and use regex matching to deal with the escaping of * and + and just use contains() to look for occurrences of ",}". Of course, if it were me, the cost of false positive here and there would be low -- your mileage may differ.) I hope this helps. "Roger L Costello costello@xxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> writes: > Hi Folks, > > I want to find, in an XML Schema, all xs:pattern elements containing a regex that permits an unbounded number of characters. > > Here are examples of xs:pattern elements that I want to find: > > <xs:pattern value="A*"/> > <xs:pattern value="A+"/> > <xs:pattern value="A{0,.}"/> > <xs:pattern value="A{1,.}"/> > > I do not want either of the following xs:pattern elements because -- due to the escape symbol -- they do not permit an unbounded number of characters: > > <xs:pattern value="A\*"/> > <xs:pattern value="A\+"/> > > I created an XPath 2.0 expression to find the desired xs:pattern elements: > > xs:pattern[ > contains(@value, '*') or > contains(@value, '+') or > contains(@value, '{1,}') or > contains(@value, '{0,}') > ] > > Eek! That is not correct. It incorrectly returns the xs:pattern elements with escaped asterisk and escaped plus symbols: > > <xs:pattern value="A\*"/> > <xs:pattern value="A\+"/> > > How to fix my XPath expression? Is the solution to add a second predicate: > > xs:pattern[ > contains(@value, '*') or > contains(@value, '+') or > contains(@value, '{1,}') or > contains(@value, '{0,}') > ][ > not(contains(@value, '\*')) and > not(contains(@value, '\+')) > ] > > Is that correct? Is that the best approach? Is there a better approach? > > Bonus points if you can answer this question: Is my XPath expression catching all xs:pattern elements that have a regex that permits an unbounded number of characters? > > Note: For reasons that I will not explain, the XPath expression must be an XPath 2.0 expression. > > /Roger > > > > -- C. M. Sperberg-McQueen Black Mesa Technologies LLC http://blackmesatech.com
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] Need an XPath expression , Dimitre Novatchev dn | Thread | [xsl] Move Text, Byomokesh Sahoo saho |
Re: [xsl] Need an XPath expression , Dimitre Novatchev dn | Date | Re: [xsl] Need an XPath expression , Liam R. E. Quin liam |
Month |