Subject: [xsl] two <xsl:analyze-string> questions From: "Birnbaum, David J" <djbpitt@xxxxxxxx> Date: Sat, 22 Oct 2011 10:55:33 -0400 |
Dear XSLT-List, I'd be grateful for advice about a two-part <xsl:analyze-string> problem. I'm post-processing messy OCR output, and the situation I'm trying to address involves patterns and patterned errors that can be identified through regex matching. Some of the patterns are traditional up-conversion (e.g., find a certain pattern of digits and punctuation and wrap markup around it); some of them are corrections (e.g., the digit "6" and the letter "b" are confused, but a digit "6" adjacent to a letter is probably an error and should be corrected automatically, while a digit "6" not adjacent to a letter probably isn't and should be left alone). 1. The first part of my problem involves general program logic. I'm currently using a strategy like the following: <xsl:template match="text()"> <xsl:call-template name="editionLineNo"> <xsl:with-param name="current" select="."/> </xsl:call-template> </xsl:template> <xsl:template name="editionLineNo"> <!-- 1. check for digits plus period, \d+\., edition line no --> <xsl:param name="current"/> <xsl:analyze-string select="$current" regex="(\d+)\."> <xsl:matching-substring> <editionLineNo> <xsl:value-of select="regex-group(1)"/> </editionLineNo> </xsl:matching-substring> <xsl:non-matching-substring> <xsl:call-template name="msFolioNo"> <xsl:with-param name="current" select="$current"/> </xsl:call-template> </xsl:non-matching-substring> </xsl:analyze-string> </xsl:template> That is, at the beginning I grab a pristine text node and look for a pattern. If it's there, I'm done; if not, I pass the non-matching substring to the next template to look for a different pattern. One template calls another, passing the unmatched substrings, until the end, when I just output the text. This works, but is it the best approach? Should I instead, for example, use a single callable template and pass it both the haystack string and the needle regex? My highest priorities are legibility and ease of development and maintenance; efficiency of operation is less important. In case this is important, the order in which the patterns are matched matters, at least in a few instances. For example, digits followed by a period get one kind of markup and digits not followed by a period get another, so I want to capture the first type first and get them out of the way before looking for the second. 2. The second part of my problem involves a particular type of regex, one that will, for example, identify a digit "6" that is adjacent to a letter and replace it with a letter "b". The adjacent letter could precede or follow the digit or both. If I make the preceding and following letter(s) optional in the pattern, I've made both optional, and I'll erroneously catch an isolated digit "6". If I use a disjunct pattern, it becomes harder to capture the pieces and output the ones I want to retain with regex-group(). I suspect that this is a common problem with a standard solution, but I haven't run into it before and no single, elegant but legible regex leaps to mind. Is there one? Thanks for any advice,, David djbpitt@xxxxxxxxx
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] Setting a boolean value, Michael Kay | Thread | Re: [xsl] two <xsl:analyze-string> , Brandon Ibach |
Re: [xsl] Setting a boolean value, Mark | Date | Re: [xsl] two <xsl:analyze-string> , Brandon Ibach |
Month |