Subject: Re: [xsl] cleaning up ill-structured html From: S Woodside <sbwoodside@xxxxxxxxx> Date: Thu, 23 Jan 2003 17:43:26 -0500 |
I have the task of migrating a number of legacy html pages that were authored wihtout regard to proper structuring. Body text paragraphs are delimited by any combination of <p> (sometimes nested!) and runs of <br>. I would like the result to consist of a flat list of non-empty <p>'s.
I use JTidy to get into proper XML, but still face the challenge of flattening the nested <p>'s and converting runs of <br>'s to <p>'s.
Example:
<p>Some <i>stuff</i> that should be cleaned.<br/> More <b>stuff.</b> <p> Yet more.<br> </p> Stuff. </p>
Should become:
<p>Some <i>stuff</i> that should be cleaned.</p> <p>More <b>stuff.</b></p> <p>Yet more.</p> <p>Stuff.</p>
I assume it is easiest to do in two steps, first (step 1) convert into something like this:
<break/> Some <i>stuff</i> that should be cleaned. <break/> More <b>stuff.</b> <break/> Yet more. <break/> <break/> Stuff. <break/>
and then (step 2) detecting continuous runs of non-<break/> nodes, and wrapping these runs in <p></p>'s.
Do I make sense?
I can do step 1, but step 2 gives me trouble. To formalise: how do I convert a structure structure like
<break/>+ { other+ <break/>+ }*
into
{ <p> other+ </p> }*
I fear the solution is really simple. Any ideas?
Thanks, Ole Sandum, osandum@xxxxxxxxxxx
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
--- anti-spam: do not post this address publicly www.simonwoodside.com -- 99% Devil, 1% Angel
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] cleaning up ill-structure, Niko Matsakis | Thread | Re: [xsl] cleaning up ill-structure, Wendell Piez |
[xsl] OT: XForms, Bernd Gauweiler | Date | RE: [xsl] Can I substitute a predef, Ganesh Murthy |
Month |