[xsl] cleaning up ill-structured html

Subject: [xsl] cleaning up ill-structured html
From: Ole Sandum <osandum@xxxxxxxxxxx>
Date: Thu, 23 Jan 2003 21:54:43 +0100
I have the task of migrating a number of legacy html
pages that were authored wihtout regard to proper
structuring. Body text paragraphs are delimited by any
combination of <p> (sometimes nested!) and runs of
<br>. I would like the result to consist of a flat list
of non-empty <p>'s.

I use JTidy to get into proper XML, but still face the
challenge of flattening the nested <p>'s and converting
runs of <br>'s to <p>'s.

Example:

   <p>Some <i>stuff</i>
   that should be cleaned.<br/>
   More <b>stuff.</b>
   <p>
   Yet more.<br>
   </p>
   Stuff.
   </p>

Should become:

   <p>Some <i>stuff</i> that should be cleaned.</p>
   <p>More <b>stuff.</b></p>
   <p>Yet more.</p>
   <p>Stuff.</p>

I assume it is easiest to do in two steps, first (step
1) convert into something like this:

   <break/>
   Some <i>stuff</i> that should be cleaned.
   <break/>
   More <b>stuff.</b>
   <break/>
   Yet more.
   <break/>
   <break/>
   Stuff.
   <break/>

and then (step 2) detecting continuous runs of
non-<break/> nodes, and wrapping these runs in <p></p>'s.

Do I make sense?

I can do step 1, but step 2 gives me trouble. To
formalise: how do I convert a structure structure like

<break/>+ { other+ <break/>+ }*

into

{ <p> other+ </p> }*

I fear the solution is really simple. Any ideas?

Thanks,
Ole Sandum, osandum@xxxxxxxxxxx




XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list



Current Thread