Re: [xsl] Text based stage play scripts to XML

Subject: Re: [xsl] Text based stage play scripts to XML
From: Liam R E Quin <liam@xxxxxx>
Date: Mon, 24 Jan 2011 13:05:34 -0500
On Mon, 2011-01-24 at 14:37 +0200, Jacobus Reyneke wrote:

> Take any input file and output a similar output file. While doing so
> however, look for text located between identifiable patterns. Surround
> this text with tags.
> 
> If input file contains:
> a b c d e f g h i j
> 
> Pattern description:
> any string that follow after the string "c d" and is followed by the
> string "g h"
> 
> If pattern found:
> Surround with <found-you>
> 
> Result:
> a b c d<found-you> e f </found-you>g h i j

Others have mentioned some XSLT approaches, and that's generally a good
way to go.  Of course, if you don't mind learning a programming
language, Perl is the king (or at least a princess) of transformations
where you don't yet have XML, but want to add markup. Use XML-aware
tools as early in the process as possible, though!

while (<>) { # for each line of input
    s{c d\K e f (?=g h)}{ # replace with the value of...:
	element(
	    "found-you",  # element name
	    $&,           # what was matched (" e f " here)
            # optional attributes:
	    "rule" => "31",
	    "before" => "c d"
	)
    }e;  # "e" flag means the replacement is an expression, not text

    print; # print the line whether or not it was changed
}

Given the input a b c d e f g h
this produces
a b c d<found-you rule="31" before="c d"> e f </found-you>g h

To process a whole file at once, you can use the rather odd Perl idiom,
my $text { 
    local $/; # slurp mode
    $text = <>;
};

# and then do the substitution:
$text =~ s{as before}{as before}gme;

At that point you might (or might not) want to use \s+ rather than a
space between the tokens in the input, to match one or more whitespace
characters.  Start by normalizing the text though -- look for lines
ending with spaces, for example, and trim them.

Adding an attribute showing which pattern put a tag in place can
considerably aid debugging the process.  It also helps to be consistent
in your markup, e.g. *always* use double quotes for attribute values.

A simple definition of the "element" function follows - I have tried to
avoid "clever" Perl, and I have left a couple of items in place that
help debugging.  For production it would probably also handle quoting
special characters (& < > in content) as well as (already done) " in
attribute values.

It's relatively straight forward using this approach to get files that
can be processed further with XML tools, although even then I sometimes
use Perl, e.g. because of its more powerful regular expressions, or
because I can more easily check for filenames...

You could have a separate file of patterns that are loaded and matched
against. On Linux, run the command, perldoc perlre, for some
documentation.

Liam

#! /usr/bin/perl -w
use warnings;
use strict;

sub element($$;%)
{
    my ($name, $content, %attributes) = @_;

    sub quotedattvalue($$)
    {
	my ($name, $value) = @_;

	# print STDERR "q $name, $value\n";
	$value =~ s/"/\&quot;/g; # so we can safely use quotes
	return '"' . $value . '"';
    }

    # make a list of att="value" pairs, each with a leading space:
    # (could use join and map to do this too more succinctly,
    # see perldoc -f map)
    my $atts = "";
    if (%attributes) {
	foreach (keys %attributes) {
	    $atts .= " " .
	        $_ . '=' .  quotedattvalue($_, $attributes{$_})
	    ;
	}
    }

    return "<${name}${atts}>${content}</${name}>";
}

my $text;
{
    local $/;
    $text = <>;
};

$text =~ s{c d\K e f (?=g h)}{
	element(
	    "found-you",
	    $&,
	    "rule" => "31",
	    "before" => "c d"
	)
    }gme;
    print $text;

# end

-- 
Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/
Pictures from old books: http://fromoldbooks.org/
Ankh: irc.sorcery.net irc.gnome.org www.advogato.org

Current Thread