Subject: Re: [xsl] HTML section headings to XML document sections From: Wendell Piez <wapiez@xxxxxxxxxxxxxxxx> Date: Thu, 09 Aug 2001 11:55:11 -0400 |
h1 p p h3 p p p
Enjoy, Wendell
I have a lot of XHTML documents (mostly sanitized HTML with tidy and saved with the -asxml option) that I would like to transform into XML (e.g., DocBook). The structure of HTML is however drastically different in that standard HTML does not mark up the hierarchical subdivisions of a document apart from indicating the start of each level by <h1>, <h2>, <h3>, etc. Therefore my problem is to find a general algorithm, probably using recursion, to transform an HTML document into a valid XML equivalent, in particular indicating its hierarchical structure. For instance, suppose I have an HTML source like this:
<html> <h1>...</h1>.... <h2>...</h2>.... <h2>...</h2>.... <h3>...</h3>.... <h1>...</h1>.... <h2>...</h2>.... <h3>...</h3>.... <h3>...</h3>.... <h2>...</h2>.... </html>
this should become semething like
<html> <sect1><title>...</title> .... <sect2><title>...</title> .... </sect2> <sect2><title>...</title> .... <sect3><title>...</title> .... </sect3> </sect2> </sect1> <sect1><title>...</title> .... <sect2><title>...</title> .... </sect2> <sect3><title>...</title> .... </sect3> <sect3><title>...</title> .... </sect3> </sect2> <sect2><title>...</title> .... </sect2> </sect1> </html>
So the question is how to know each time a <hx> (h1, h2, h3, ...) element is encountered what are the "open h" levels less than or equal to that of the current element, so that we can "close" them. In particular, before exiting the document we should also close the complete hierarchy correctly.
I have read with interest an article by Benoit Marchal mentioned here recently: "recurse, not divide, to conquer", where he describes the use of recursion for "hierarchising" a flat document, but I cannot really see how to apply his approach in the present case without somehow also knowing the "state" (hierarchical level) at the given point in the document. Reading the discussion of recursion in MK's book or in "Professional XSL" did not make me a lot wiser on how to solve this in an elegant way. Therefore, all suggestions are very welcome. Thanks in advance. mg
Dr. Michel Goossens Phone:(+41 22) 767-4902 CERN, IT Division Fax: (+41 22) 767-8630 CH-1211 Geneva 23, Switzerland Email: michel.goossens@xxxxxxx
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
====================================================================== Wendell Piez mailto:wapiez@xxxxxxxxxxxxxxxx Mulberry Technologies, Inc. http://www.mulberrytech.com 17 West Jefferson Street Direct Phone: 301/315-9635 Suite 207 Phone: 301/315-9631 Rockville, MD 20850 Fax: 301/315-8285 ---------------------------------------------------------------------- Mulberry Technologies: A Consultancy Specializing in SGML and XML ======================================================================
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
RE: [xsl] HTML section headings to , Michael Kay | Thread | [xsl] Question about Including exte, Brian Erickson (SWUS |
[xsl] count with variables?, Resch Martin | Date | Re: [xsl] Apache Xalan 2.2 for Java, Goetz Bock |
Month |