Re: [xsl] text replacement with mixed content

Subject: Re: [xsl] text replacement with mixed content
From: Andrew Welch <andrew.j.welch@xxxxxxxxx>
Date: Mon, 5 Sep 2011 10:08:09 +0100
Heeeey, well done, on first reading that looks really good.
Converting the nested markup to empty elements seems to be the crucial
part, so you can move the start and end around.  You should write this
up and publish it somewhere...


On 5 September 2011 09:49, Geert Bormans <geert@xxxxxxxxxxxxxxxxxxx> wrote:
> For those interested in this thread
> Here is how I resolved this...
> works like a charm on all tests,
> and I am pleased with the robustness
>
> Let me first thank all who stepped in.
> I got some inspiration from the different posts,
> your contributions are highly appreciated
>
> Since the problem is contained in paragraphs, and I can quickly check
> whether I have to bother with a revision or not per paragraph,
> it does not really slow me down (too much) by having multiple steps through
> the data
>
> The thinking was the hard work. The actual XSLT implementation was not too
> bad once the algorithm was solid
>
> Let me show you what I did (simplified) taking test 7 as an example
>
>        <in original="this old foo is breaking" revision="a new bar is
> building" >
>            <p><b type="stronger">I <i>did not realize that this </i></b>old
> foo is breaking <i>this old foo</i></p>
>        </in>
>
> Pass 1. Take out the structure by making empty element markers (with id)
> from each element tag
> and in teh mean time put off-set markers at any location where a matching
> pattern could start or end
> (if "t"  is first character in the @original" put a marker in front of
every
> "t",
> if "g" is the last character of @original, place a marker after every "g")
> markers are potential-start <ps/> and potential end <pe/>
> results in (simplified, I have namespaces, maintain attributes et al.)
>           <p><start name="b" id="A"/>I <start name="i" id="B"/>did not
> realize <ps/>that <ps/>this <end name="i" id="B"/><end name="b" id="A"/>old
> foo is breaking<pe/> <start name="i" id="C"/><ps/>this old foo<end name="i"
> id="C"/></p>
>
> now actually the hard work is done
>
> Pass 2.
> on each <ps/> check if the join of all following text nodes (normalized one
> way or another) starts with the normalized @original, if so upgrade to
> revision start <rs/>
> on each <pe/> check if the join of all preceding text nodes (normalized one
> way or another) ends with the normalized @original, if so upgrade to
> revision end <re/>
> results in
>          <p><start name="b" id="A"/>I <start name="i" id="B"/>did not
> realize that <rs/>this <end name="i" id="B"/><end name="b" id="A"/>old foo
> is breaking<re/> <start name="i" id="C"/>this old foo<end name="i"
> id="C"/></p>
>
> Pass 3.
> structure the revisions, making them real elements
> results in
>          <p><start name="b" id="A"/>I <start name="i" id="B"/>did not
> realize that <rev>this <end name="i" id="B"/><end name="b" id="A"/>old foo
> is breaking</rev> <start name="i" id="C"/>this old foo<end name="i"
> id="C"/></p>
>
> Pass 4.
> Move the end tag markers that are inside a revision having a corresponding
> start tag marker (hence the id) outside the revision to right before the
> revision
> Do something similar with start tag markers
> results in
>          <p><start name="b" id="A"/>I <start name="i" id="B"/>did not
> realize that <end name="i" id="B"/><end name="b" id="A"/><rev>this old foo
> is breaking</rev> <start name="i" id="C"/>this old foo<end name="i"
> id="C"/></p>
>
> Pass 5.
> Clean up: make the actual replacement in the revision and make the markers
> into elements again
>          <p><b>I <i>did not realize that </i></b><rev>a new bar is
> building</rev> <i>this old foo</i></p>
>
> The turning point for me was adding the offset markers,
> before I was auto-generating pretty complex regular expressions,
> now I got away with a simple ends-with() and starts-with()
>
> If anyone sees a possible improvement here or there, let me know please
>
> Me happy now, thanks for your help
>
> Geert
>
>



--
Andrew Welch
http://andrewjwelch.com

Current Thread