Subject: Re: [xsl] Text based stage play scripts to XML From: Liam R E Quin <liam@xxxxxx> Date: Mon, 24 Jan 2011 13:05:34 -0500 |
On Mon, 2011-01-24 at 14:37 +0200, Jacobus Reyneke wrote: > Take any input file and output a similar output file. While doing so > however, look for text located between identifiable patterns. Surround > this text with tags. > > If input file contains: > a b c d e f g h i j > > Pattern description: > any string that follow after the string "c d" and is followed by the > string "g h" > > If pattern found: > Surround with <found-you> > > Result: > a b c d<found-you> e f </found-you>g h i j Others have mentioned some XSLT approaches, and that's generally a good way to go. Of course, if you don't mind learning a programming language, Perl is the king (or at least a princess) of transformations where you don't yet have XML, but want to add markup. Use XML-aware tools as early in the process as possible, though! while (<>) { # for each line of input s{c d\K e f (?=g h)}{ # replace with the value of...: element( "found-you", # element name $&, # what was matched (" e f " here) # optional attributes: "rule" => "31", "before" => "c d" ) }e; # "e" flag means the replacement is an expression, not text print; # print the line whether or not it was changed } Given the input a b c d e f g h this produces a b c d<found-you rule="31" before="c d"> e f </found-you>g h To process a whole file at once, you can use the rather odd Perl idiom, my $text { local $/; # slurp mode $text = <>; }; # and then do the substitution: $text =~ s{as before}{as before}gme; At that point you might (or might not) want to use \s+ rather than a space between the tokens in the input, to match one or more whitespace characters. Start by normalizing the text though -- look for lines ending with spaces, for example, and trim them. Adding an attribute showing which pattern put a tag in place can considerably aid debugging the process. It also helps to be consistent in your markup, e.g. *always* use double quotes for attribute values. A simple definition of the "element" function follows - I have tried to avoid "clever" Perl, and I have left a couple of items in place that help debugging. For production it would probably also handle quoting special characters (& < > in content) as well as (already done) " in attribute values. It's relatively straight forward using this approach to get files that can be processed further with XML tools, although even then I sometimes use Perl, e.g. because of its more powerful regular expressions, or because I can more easily check for filenames... You could have a separate file of patterns that are loaded and matched against. On Linux, run the command, perldoc perlre, for some documentation. Liam #! /usr/bin/perl -w use warnings; use strict; sub element($$;%) { my ($name, $content, %attributes) = @_; sub quotedattvalue($$) { my ($name, $value) = @_; # print STDERR "q $name, $value\n"; $value =~ s/"/\"/g; # so we can safely use quotes return '"' . $value . '"'; } # make a list of att="value" pairs, each with a leading space: # (could use join and map to do this too more succinctly, # see perldoc -f map) my $atts = ""; if (%attributes) { foreach (keys %attributes) { $atts .= " " . $_ . '=' . quotedattvalue($_, $attributes{$_}) ; } } return "<${name}${atts}>${content}</${name}>"; } my $text; { local $/; $text = <>; }; $text =~ s{c d\K e f (?=g h)}{ element( "found-you", $&, "rule" => "31", "before" => "c d" ) }gme; print $text; # end -- Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/ Pictures from old books: http://fromoldbooks.org/ Ankh: irc.sorcery.net irc.gnome.org www.advogato.org
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] Text based stage play scr, Liam R E Quin | Thread | Re: [xsl] Text based stage play scr, Jacobus Reyneke |
Re: [xsl] round-half-to-even proble, Michael Kay | Date | Re: [xsl] Text based stage play scr, Liam R E Quin |
Month |