Subject: Re: [xsl] text replacement with mixed content From: Liam R E Quin <liam@xxxxxx> Date: Wed, 31 Aug 2011 11:55:12 -0400 |
On Wed, 2011-08-31 at 13:23 +0200, Geert Bormans wrote: [...] > - there will be no tags inside words, though I have found non > breaking spaces and soft hyphens at unpleasant locations > something I have to take into account when I dynamically generate the > regular expression > - there will be no matching across paragraphs (I can rely on some 5 > or 6 elements that can have patterns to be matched, but will bear > them completely) In doing document up-conversion from plain text/OCR output to XML, I tend to use a mix of languages - this particular problem is more about the text than the markup, and I'd probably use Perl rather than XSLT. However, I *would* run xml validation (or at least well-formedness checking) on the output! Some techniques that may help -- Temporarily removing markup: $text =~ s{ (<[^>]+>) }{ hide($1) }xeg; where hide() is a function like this: my @stash; sub hide($) { my ($input) = @_; push @stash, $input; return "///' . $#input . '___; } and of course we can restore the markup like this: $text =~ s{ ///(\d+)___ }{ $stash[$1] }xeg; (it's a good idea to check that ### does not occur in the input first, and also that it does not occur in the output!) Now you can handle phrases easily, sine you have the constraint that tags don't occur in the middle of words: $text =~ s{ \b # match only at a word boundary ( # save in a group black (?: # non-capturing group (?:///\d+___) # a hidden tag | \s )+ # so, any amount of space or hidden tags socks )\b }{ elem("glamorous", $1) }xeg; and then do the unhiding. Here, elem is a function to make an XML element: sub elem($$) { my ($name, $content) = @_; return "<name>" . $content . "</name>"; } You could use {<glamorous>$1</gamorous>} as the replacement instead. The same techniques work in other languages, although Perl's regular expressions have the advantage of (1) the "x" flag, allowing whitespace and comments, and (2) the "e" flag, allowing expressions instead of text. The command, perldoc perlre, on Linux, OS X and Unix, gives (a lot) more information, although you may need to install perl-doc. I find that doing this sort of change in XSLT or XQuery can lead to a lot of confusion, but I'm not as clear-thinking as some of the others on this list, I suppose. Liam -- Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/ Pictures from old books: http://fromoldbooks.org/
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] text replacement with mix, Geert Bormans | Thread | [xsl] using entities, Furst, Tom |
Re: [xsl] text replacement with mix, Geert Bormans | Date | Re: [xsl] text replacement with mix, Liam R E Quin |
Month |