[xsl] cleaning up ill-structured html

Subject: [xsl] cleaning up ill-structured html
From: Jim_Albright@xxxxxxxxxxxx
Date: Fri, 24 Jan 2003 13:41:10 -0500
with this input

<p>Some <i>stuff</i>
that should be cleaned.<br/>
More <b>stuff.</b>
<p>
Yet more.<br>
</p>
Stuff.
</p>

I have this XML output that you can clean up with XSLT

<sample>
<p>Some <emphasis>stuff</emphasis> that should be cleaned.</p>
<paragraph>More <strong>stuff.</strong></paragraph>
<p>Yet more.</p>
<paragraph>Stuff.</paragraph>
</sample>

Using this XML control file:

<?xml version="1.0" encoding="ISO-8859-1" ?>
<!DOCTYPE convert2xml SYSTEM "c:\d\xml\convert2xml.dtd" >

<!--

file:       HTML-cleanup.ctl
Purpose:    Control file for c2x program
Author:     jaa
Date:       20020124

            Clean up dirty HTML and make it into good XML
-->


<convert2xml>
<root-element name="sample">
      </root-element>
<recognize-element name="paragraph">
      <start-token>
            <pattern>\pp</pattern>
            <before>&#xa;</before>
      </start-token>
      <end-token>
            <pattern>&#xa;&lt;/p></pattern>
      </end-token>
      <allowed-child ref="emphasis"/>
      <allowed-child ref="strong"/>
</recognize-element>

<recognize-element name="p">
      <start-token>
            <pattern>&lt;p>&#xa;</pattern>
            <before>&#xa;</before>
      </start-token>
      <start-token>
            <pattern>&lt;p></pattern>
            <before>&#xa;</before>
      </start-token>
      <end-token>
            <pattern>&lt;/p></pattern>
      </end-token>
      <end-token>
            <pattern>&lt;b>&#xa;&lt;/p></pattern>
      </end-token>
      <end-token>
            <pattern>&lt;br/>&#xa;</pattern>
            <parsed-after>\pp</parsed-after>
      </end-token>
      <end-token>
            <pattern>&lt;br/>&#xa;&lt;/p></pattern>
            <parsed-after>\pp</parsed-after>
      </end-token>
      <end-token>
            <pattern>&lt;br>&#xa;&lt;/p>&#xa;</pattern>
            <parsed-after>\pp</parsed-after>
      </end-token>
      <end-token>
            <pattern>&lt;br/></pattern>
            <parsed-after>\pp</parsed-after>
      </end-token>
      <end-token>
            <pattern>&lt;br></pattern>
      </end-token>
      <end-token>
            <pattern>&#xa;&lt;/p></pattern>
      </end-token>

      <allowed-child ref="emphasis"/>
      <allowed-child ref="strong"/>
</recognize-element>

<recognize-element name="emphasis">
      <start-token>
            <pattern>&lt;i></pattern>
      </start-token>
      <end-token>
            <pattern>&lt;/i></pattern>
      </end-token>
      <end-token>
            <pattern>&lt;/i>&#xa;</pattern>
            <after> </after>
      </end-token>
</recognize-element>

<recognize-element name="strong">
      <start-token>
            <pattern>&lt;b></pattern>
      </start-token>
      <end-token>
            <pattern>&lt;/b></pattern>
      </end-token>
      <end-token>
            <pattern>&lt;/b>&#xa;</pattern>
      </end-token>
</recognize-element>

</convert2xml>

In a free program called C2X -- convert to XML.

Ask me off list if you want more info as C2X is off topic.

Date: Thu, 23 Jan 2003 21:54:43 +0100
From: Ole Sandum <osandum@xxxxxxxxxxx>
Subject: [xsl] cleaning up ill-structured html

Example:

    <p>Some <i>stuff</i>
    that should be cleaned.<br/>
    More <b>stuff.</b>
    <p>
    Yet more.<br>
    </p>
    Stuff.
    </p>

Should become:

    <p>Some <i>stuff</i> that should be cleaned.</p>
    <p>More <b>stuff.</b></p>
    <p>Yet more.</p>
    <p>Stuff.</p>





 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread