Subject: Re: html to xml From: tra@xxxxxxxxxxxxxxx (Thorbjoern Ravn Andersen) Date: Fri, 27 Oct 2000 14:17:07 +0200 |
* Sebastian Rahtz <sebastian.rahtz@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> [Oct 27. 2000 12:58]: > Lisa van Gelder writes: > > The basic problem is that the html you are getting is not structured enough > > for your purposes. > > > > I had the same problem, and solved it by setting rules for how the html > > could be structured, so it could be converted into xml more easily. I do not > > allow any text that is not surrounded by tags. > > I was afraid someone would say that. My problem is that the task is to > convert our existing web pages (6196 documents, at last count) to (TEI DTD) > XML. So I have no control over the original coding. So the conclusion > is, I guess, "clean up the HTML minimally even before running tidy". Could you introduce an XSLT step that said that all text()-nodes with a h1..h6 tag as their immediate parent, should be enclosed in <p>-tags? -- Thorbjørn Ravn Andersen "...sound of... Tubular Bells!" http://bigfoot.com/~thunderbear XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: html to xml, Sebastian Rahtz | Thread | Re: html to xml, Sebastian Rahtz |
Re: need xsl for collapsed treeview, Max Dunn | Date | Re: Entity Reference Question, Paul Caton |
Month |