|
Subject: [xsl] cleaning up ill-structured html From: Jim_Albright@xxxxxxxxxxxx Date: Fri, 24 Jan 2003 13:41:10 -0500 |
with this input
<p>Some <i>stuff</i>
that should be cleaned.<br/>
More <b>stuff.</b>
<p>
Yet more.<br>
</p>
Stuff.
</p>
I have this XML output that you can clean up with XSLT
<sample>
<p>Some <emphasis>stuff</emphasis> that should be cleaned.</p>
<paragraph>More <strong>stuff.</strong></paragraph>
<p>Yet more.</p>
<paragraph>Stuff.</paragraph>
</sample>
Using this XML control file:
<?xml version="1.0" encoding="ISO-8859-1" ?>
<!DOCTYPE convert2xml SYSTEM "c:\d\xml\convert2xml.dtd" >
<!--
file: HTML-cleanup.ctl
Purpose: Control file for c2x program
Author: jaa
Date: 20020124
Clean up dirty HTML and make it into good XML
-->
<convert2xml>
<root-element name="sample">
</root-element>
<recognize-element name="paragraph">
<start-token>
<pattern>\pp</pattern>
<before>
</before>
</start-token>
<end-token>
<pattern>
</p></pattern>
</end-token>
<allowed-child ref="emphasis"/>
<allowed-child ref="strong"/>
</recognize-element>
<recognize-element name="p">
<start-token>
<pattern><p>
</pattern>
<before>
</before>
</start-token>
<start-token>
<pattern><p></pattern>
<before>
</before>
</start-token>
<end-token>
<pattern></p></pattern>
</end-token>
<end-token>
<pattern><b>
</p></pattern>
</end-token>
<end-token>
<pattern><br/>
</pattern>
<parsed-after>\pp</parsed-after>
</end-token>
<end-token>
<pattern><br/>
</p></pattern>
<parsed-after>\pp</parsed-after>
</end-token>
<end-token>
<pattern><br>
</p>
</pattern>
<parsed-after>\pp</parsed-after>
</end-token>
<end-token>
<pattern><br/></pattern>
<parsed-after>\pp</parsed-after>
</end-token>
<end-token>
<pattern><br></pattern>
</end-token>
<end-token>
<pattern>
</p></pattern>
</end-token>
<allowed-child ref="emphasis"/>
<allowed-child ref="strong"/>
</recognize-element>
<recognize-element name="emphasis">
<start-token>
<pattern><i></pattern>
</start-token>
<end-token>
<pattern></i></pattern>
</end-token>
<end-token>
<pattern></i>
</pattern>
<after> </after>
</end-token>
</recognize-element>
<recognize-element name="strong">
<start-token>
<pattern><b></pattern>
</start-token>
<end-token>
<pattern></b></pattern>
</end-token>
<end-token>
<pattern></b>
</pattern>
</end-token>
</recognize-element>
</convert2xml>
In a free program called C2X -- convert to XML.
Ask me off list if you want more info as C2X is off topic.
Date: Thu, 23 Jan 2003 21:54:43 +0100
From: Ole Sandum <osandum@xxxxxxxxxxx>
Subject: [xsl] cleaning up ill-structured html
Example:
<p>Some <i>stuff</i>
that should be cleaned.<br/>
More <b>stuff.</b>
<p>
Yet more.<br>
</p>
Stuff.
</p>
Should become:
<p>Some <i>stuff</i> that should be cleaned.</p>
<p>More <b>stuff.</b></p>
<p>Yet more.</p>
<p>Stuff.</p>
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
| Current Thread |
|---|
|
| <- Previous | Index | Next -> |
|---|---|---|
| Re: [xsl] cleaning up ill-structure, David Carlisle | Thread | [xsl] OT: XForms, Bernd Gauweiler |
| RE: [xsl] Caching document in brows, cknell | Date | Re: [xsl] RE: [announce] New Visual, W. Eliot Kimber |
| Month |