[xsl] Aligning/merging two sequences

Subject: [xsl] Aligning/merging two sequences
From: Markus Flatscher <markus.flatscher@xxxxxxxxxxxx>
Date: Thu, 30 Sep 2010 12:51:00 -0400
I'm banging my head against a sequence alignment problem. I have a feeling that this is straightforward, but I can't put my finger on what's missing from my attempts.

Suppose I have two inputs like so, where input1//w is always a subset of input2//w:

<input1>
 <w n="1">I</w>
 <w n="2">am</w>
 <w n="3">a</w>
 <w n="4">sequence</w>
</input1>

<input2>
 <w>I</w>
 <w>am</w>
 <w>a</w>
 <w>longer</w>
 <w>longer</w>
 <w>sequence</w>
</input2>

I'd like to get output like so:

<output>
 <w n="1">I</w>
 <w n="2">am</w>
 <w n="3">a</w>
 <w n="skipped">longer</w>
 <w n="skipped">longer</w>
 <w n="4">sequence</w>
</output>

I.e., for each input1//w, @n should be copied to the nearest following sibling <w> in input2 that matches .; <w>s in input2 that aren't in input1 should be flagged as "skipped".

P.S.: The use case is aligning an imperfect but timestamped transcription of an audio file (input1, machine-generated) with a perfect but not-timestamped one (input2, human-generated).

Thanks much for any help,

Markus

--
Markus Flatscher, Project Editor
ROTUNDA, The University of Virginia Press
PO Box 400314, Charlottesville VA 22904, USA
Courier: 211 Emmet Street South, Charlottesville VA 22903, USA
Email: markus.flatscher@xxxxxxxxxxxx
Web: http://rotunda.upress.virginia.edu/

Current Thread