Re: [xsl] cleaning up ill-structured html

Subject: Re: [xsl] cleaning up ill-structured html
From: S Woodside <sbwoodside@xxxxxxxxx>
Date: Thu, 23 Jan 2003 17:43:26 -0500
Probably best to start over with the HTML and use regular expressions, e.g., with Perl. This is not an XSL subject, the html you are starting with is not tree-structure so XSL is no help.

simon

On Thursday, January 23, 2003, at 03:54 PM, Ole Sandum wrote:

I have the task of migrating a number of legacy html
pages that were authored wihtout regard to proper
structuring. Body text paragraphs are delimited by any
combination of <p> (sometimes nested!) and runs of
<br>. I would like the result to consist of a flat list
of non-empty <p>'s.

I use JTidy to get into proper XML, but still face the
challenge of flattening the nested <p>'s and converting
runs of <br>'s to <p>'s.

Example:

   <p>Some <i>stuff</i>
   that should be cleaned.<br/>
   More <b>stuff.</b>
   <p>
   Yet more.<br>
   </p>
   Stuff.
   </p>

Should become:

   <p>Some <i>stuff</i> that should be cleaned.</p>
   <p>More <b>stuff.</b></p>
   <p>Yet more.</p>
   <p>Stuff.</p>

I assume it is easiest to do in two steps, first (step
1) convert into something like this:

   <break/>
   Some <i>stuff</i> that should be cleaned.
   <break/>
   More <b>stuff.</b>
   <break/>
   Yet more.
   <break/>
   <break/>
   Stuff.
   <break/>

and then (step 2) detecting continuous runs of
non-<break/> nodes, and wrapping these runs in <p></p>'s.

Do I make sense?

I can do step 1, but step 2 gives me trouble. To
formalise: how do I convert a structure structure like

<break/>+ { other+ <break/>+ }*

into

{ <p> other+ </p> }*

I fear the solution is really simple. Any ideas?

Thanks,
Ole Sandum, osandum@xxxxxxxxxxx




XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list



---
     anti-spam: do not post this address publicly
www.simonwoodside.com -- 99% Devil, 1% Angel


XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list



Current Thread