Re: [xsl] text replacement with mixed content

Subject: Re: [xsl] text replacement with mixed content
From: Alex Muir <alex.g.muir@xxxxxxxxx>
Date: Wed, 31 Aug 2011 11:09:51 +0000
> <x>
> <p>my foo, zzz, my <bold>foo</bold>, zzz</p>
> <p>zzz my f<b/>oo zzz m<z>y f</z>oo zzz</p>
> <p>zzz my f<b/>oo zzz m<z>y f</z>o<j>o my foo</j> zzz</p>
> <p>zzz my<b> fjjj</b></p>
> <p>zzzzz<b>my </b></p><p><b>foo zzz</b></p>
> </x>

How about regex it...

(my) *(</?\w+/?>)*(f) *(</?\w+/?>)*(o) *(</?\w+/?>)*(o)

$1 $3$5$7

One solution is to read the input as unparsed-text and create some
regex to identify the pieces and replace the identified pieces as
required. I've done this type of thing with HTML and regex before.

You can create regex to grep through all the documents extracting all
content that may be a match with some regex like my.{1,30}foo

Problems you can get into with HTML are things like <b>m>/b><b>y</b>
.. not that anyone would write that nonsense or purpose but something
similiar where individual words you are looking for is divided into
pieces.

Then use that extracted set potential matches to create one or a few
regex that works to identify the pieces you require. At that point you
have a good understanding of the problem set.

Also generally I replace escape characters for example < and  > with +
; to make things easier visually when working with unparsed-text.

Probably Andrew's solution is better although I think I would still
want to grep all potential cases and perhaps write some regex to look
for bizarre cases to better understand the data.




--
Alex Muir
Instructor | Program Organizer - University Technology Student Work
Experience Building
University of the Gambia
http://sites.utg.edu.gm/alex/

Low budget software development benefiting development in the Gambia,
West Africa
Experience of a lifetime, come to Gambia and Join UTSWEB -
http://sites.utg.edu.gm/utsweb/

Current Thread