|
Subject: Re: [xsl] two <xsl:analyze-string> questions From: Brandon Ibach <brandon.ibach@xxxxxxxxxxxxxxxxxxx> Date: Sat, 22 Oct 2011 12:43:20 -0400 |
The following might work for part 2.
<xsl:variable name="regex" select="'(\p{L})6(\p{L}?)|(\p{L}?)6(\p{L})'"/>
<xsl:analyze-string select="." regex="{$regex}">
<xsl:matching-substring>
<xsl:value-of select="concat(regex-group(1), regex-group(3),
'b', regex-group(2), regex-group(4))"/>
</xsl:matching-substring>
<xsl:non-matching-substring>
<xsl:value-of select="."/>
</xsl:non-matching-substring>
</xsl:analyze-string>
-Brandon :)
On Sat, Oct 22, 2011 at 10:55 AM, Birnbaum, David J <djbpitt@xxxxxxxx> wrote:
> Dear XSLT-List,
>
> I'd be grateful for advice about a two-part <xsl:analyze-string> problem.
I'm post-processing messy OCR output, and the situation I'm trying to address
involves patterns and patterned errors that can be identified through regex
matching. Some of the patterns are traditional up-conversion (e.g., find a
certain pattern of digits and punctuation and wrap markup around it); some of
them are corrections (e.g., the digit "6" and the letter "b" are confused, but
a digit "6" adjacent to a letter is probably an error and should be corrected
automatically, while a digit "6" not adjacent to a letter probably isn't and
should be left alone).
>
> 1. The first part of my problem involves general program logic. I'm
currently using a strategy like the following:
>
> <xsl:template match="text()">
> <xsl:call-template name="editionLineNo">
> <xsl:with-param name="current" select="."/>
> </xsl:call-template>
> </xsl:template>
> <xsl:template name="editionLineNo">
> <!-- 1. check for digits plus period, \d+\., edition line no -->
> <xsl:param name="current"/>
> <xsl:analyze-string select="$current" regex="(\d+)\.">
> <xsl:matching-substring>
> <editionLineNo>
> <xsl:value-of select="regex-group(1)"/>
> </editionLineNo>
> </xsl:matching-substring>
> <xsl:non-matching-substring>
> <xsl:call-template name="msFolioNo">
> <xsl:with-param name="current" select="$current"/>
> </xsl:call-template>
> </xsl:non-matching-substring>
> </xsl:analyze-string>
> </xsl:template>
>
> That is, at the beginning I grab a pristine text node and look for a
pattern. If it's there, I'm done; if not, I pass the non-matching substring to
the next template to look for a different pattern. One template calls another,
passing the unmatched substrings, until the end, when I just output the text.
>
> This works, but is it the best approach? Should I instead, for example, use
a single callable template and pass it both the haystack string and the needle
regex? My highest priorities are legibility and ease of development and
maintenance; efficiency of operation is less important. In case this is
important, the order in which the patterns are matched matters, at least in a
few instances. For example, digits followed by a period get one kind of markup
and digits not followed by a period get another, so I want to capture the
first type first and get them out of the way before looking for the second.
>
> 2. The second part of my problem involves a particular type of regex, one
that will, for example, identify a digit "6" that is adjacent to a letter and
replace it with a letter "b". The adjacent letter could precede or follow the
digit or both. If I make the preceding and following letter(s) optional in the
pattern, I've made both optional, and I'll erroneously catch an isolated digit
"6". If I use a disjunct pattern, it becomes harder to capture the pieces and
output the ones I want to retain with regex-group(). I suspect that this is a
common problem with a standard solution, but I haven't run into it before and
no single, elegant but legible regex leaps to mind. Is there one?
>
> Thanks for any advice,,
>
> David
> djbpitt@xxxxxxxxx
| Current Thread |
|---|
|
| <- Previous | Index | Next -> |
|---|---|---|
| [xsl] two <xsl:analyze-string> ques, Birnbaum, David J | Thread | [xsl] Generating Unique Identifier, Lighton Phiri |
| [xsl] two <xsl:analyze-string> ques, Birnbaum, David J | Date | Re: [xsl] Setting a boolean value, Michael Kay |
| Month |