RE: [xsl] Using analyze-string to catch roman numerals?

Subject: RE: [xsl] Using analyze-string to catch roman numerals?
From: "Michael Kay" <mike@xxxxxxxxxxxx>
Date: Thu, 9 Oct 2008 23:05:57 +0100
The two things wrong with your solution are:

(a) you're matching any sequence of letters that could be a roman numeral,
without looking at the context, hence matching the IX in APPENDIX.

(b) you're only matching the first thing in each element that looks like a
roman numeral

The second is easily fixed: don't use an anchored regex in analyze-string
like this

regex="^(.*?)([IVXL]+)(.*?)$"

Instead use an unanchored regex

regex="([IVXL]+)"

and add an xsl:non-matching-substring element that copies unmatched
substrings across unchanged (or case-converted if you want).

Problem (a) is much harder. You can get a fair way by requiring the sequence
of IVXL to have non-letters before and after it. But you'll still be
matching the word "ILL" as a roman numeral when it clearly isn't. Like all
up-conversion tasks, though, it's very much up to you how much time you want
to spend fine-tuning the patterns and rules that you define.

Michael Kay
http://www.saxonica.com/ 

> -----Original Message-----
> From: Tony Zanella [mailto:tony.zanella@xxxxxxxxx] 
> Sent: 09 October 2008 20:18
> To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
> Subject: [xsl] Using analyze-string to catch roman numerals?
> 
> Hello all,
> 
> Given the following input:
> 
> <root>
>     <head>CHAPTER II. THE WRECKED FOUNDATIONS OF DOMESTICITY</head>
>     <head>PROBLEMA. HELOISE XXIX.</head>
>     <head>Selected Letters</head>
>     <head>The Second Part of Henry IV.</head>
>     <head>VIII</head>
>     <head>APPENDIX VII</head>
>     <head>Appendix VII</head>
>     <head>APPENDIX</head>
>     <head>CALVIN XVII</head>
>     <head>ILLUSTRATION</head>
> </root>
> 
> and the following template:
> 
> <xsl:template match="head">
>         <xsl:choose>
>             <xsl:when test="not(matches(.,'^(.*?)([IVXL]+)(.*?)$'))">
>                 <xsl:value-of select="lower-case(.)"/>
>             </xsl:when>
>             <xsl:when test="matches(.,'^(.*?)([IVXL]+)(.*?)$')">
>                 <xsl:analyze-string select="." 
> regex="^(.*?)([IVXL]+)(.*?)$">
>                     <xsl:matching-substring>
>                         <xsl:value-of 
> select="lower-case(regex-group(1))"/>
>                         <xsl:value-of 
> select="upper-case(regex-group(2))"/>
>                         <xsl:value-of 
> select="lower-case(regex-group(3))"/>
>                     </xsl:matching-substring>
>                 </xsl:analyze-string>
>             </xsl:when>
>             <xsl:otherwise/>
>         </xsl:choose>
>     </xsl:template>
> 
> I'm trying to use analyze-string to do the following:
> Test for a roman numeral. If there isn't one, lower-case(.). 
> If there is one, break (.) into its roman numeral and 
> non-roman numeral parts, lower-case()ing the latter.
> 
> The output I get is:
> 
>     chapter II. the wrecked foundations of domesticity
>     probLema. heloise xxix.
>     selected Letters
>     the second part of henry IV.
>     VIII
>     appendIX vii
>     appendix VII
>     appendIX
>     caLVIn xvii
>     ILLustration
> 
> When what I want is this:
> 
> 	chapter II. the wrecked foundations of domesticity
> 	problema. heloise XXIX.
> 	selected letters
> 	the second part of henry IV.
> 	VIII
> 	appendix VII
> 	appendix VII
> 	appendix
> 	calvin XVII
> 	illustration
> 
>  Between my relative inexperience with both regexes and XSLT, 
> thanks for any help!
> Tony

Current Thread