Re: [xsl] text replacement with mixed content

Subject: Re: [xsl] text replacement with mixed content
From: Liam R E Quin <liam@xxxxxx>
Date: Wed, 31 Aug 2011 11:55:12 -0400
On Wed, 2011-08-31 at 13:23 +0200, Geert Bormans wrote:
[...]
> - there will be no tags inside words, though I have found non 
> breaking spaces and soft hyphens at unpleasant locations
> something I have to take into account when I dynamically generate the 
> regular expression
> - there will be no matching across paragraphs (I can rely on some 5 
> or 6 elements that can have patterns to be matched, but will bear 
> them completely)

In doing document up-conversion from plain text/OCR output to XML, I
tend to use a mix of languages - this particular problem is more about
the text than the markup, and I'd probably use Perl rather than XSLT.
However, I *would* run xml validation (or at least well-formedness
checking) on the output!

Some techniques that may help --

Temporarily removing markup:
    $text =~ s{
       (<[^>]+>)
    }{
       hide($1)
    }xeg;

where hide() is a function like this:

    my @stash;

    sub hide($)
    {
        my ($input) = @_;

        push @stash, $input;
        return "///' . $#input . '___;
    }

    and of course we can restore the markup like this:

    $text =~ s{
       ///(\d+)___
    }{
       $stash[$1]
    }xeg;

(it's a good idea to check that ### does not occur in the input first,
and also that it does not occur in the output!)

Now you can handle phrases easily, sine you have the constraint that
tags don't occur in the middle of words:

    $text =~ s{
      \b # match only at a word boundary
        (    # save in a group
          black
          (?: # non-capturing group
              (?:///\d+___)  # a hidden tag
             | \s
          )+  # so, any amount of space or hidden tags
          socks
         )\b
    }{
      elem("glamorous", $1)
    }xeg;

and then do the unhiding.

Here, elem is a function to make an XML element:
sub elem($$)
{
    my ($name, $content) = @_;

    return "<name>" . $content . "</name>";
}

You could use
    {<glamorous>$1</gamorous>} as the replacement instead.


The same techniques work in other languages, although Perl's regular
expressions have the advantage of (1) the "x" flag, allowing whitespace
and comments, and (2) the "e" flag, allowing expressions instead of
text.

The command, perldoc perlre, on Linux, OS X and Unix, gives (a lot) more
information, although you may need to install perl-doc.

I find that doing this sort of change in XSLT or XQuery can lead to a
lot of confusion, but I'm not as clear-thinking as some of the others on
this list, I suppose.

Liam

-- 
Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/
Pictures from old books: http://fromoldbooks.org/

Current Thread