Re: [xsl] HTML section headings to XML document sections

Subject: Re: [xsl] HTML section headings to XML document sections
From: Wendell Piez <wapiez@xxxxxxxxxxxxxxxx>
Date: Thu, 09 Aug 2001 11:55:11 -0400
Michel,

The best solutions to this currently (IMHO) are Jeni's (references already posted). She and I kind of leap-frogged development of a solution (I've called it "levitation" and you'll find my contributions in the list archives, I'll bet, if you search for that -- but that name for the problem doesn't seem to have stuck :-). But of course Jeni writes great code *and* documents it. The solution is to treat the problem as a special case of grouping, driving it all with keys that associate each node to the node that indicates its proper place in the hierarchy (generally the head of the invisible section it's in).

But I think you'll find you'll have problems since your HTML coming in is not likely to be very regular. For example, if (when) you get something like...

h1
 p
 p
  h3
   p
   p
   p

you need to make a decision about whether to interpolate a missing level (that would be headed with an h2), that just happens to have no header (these things do happen in structured text), or whether to promote the h3 and its following p elements to the second level. Unfortunately, which of these ways is "correct" will depend on the documents: it may vary, and from the purist's point of view might require or demand an interpretation on a case-by-case basis. Not good.

So it will come down to (a) how good (bad) your data actually is, and (b) how brutal you can afford to be.

Enjoy,
Wendell


At 03:01 AM 8/9/01, you wrote:
I have a lot of XHTML documents (mostly sanitized HTML with tidy and saved
with the -asxml option) that I would like to transform into XML (e.g.,
DocBook). The structure of HTML is however drastically different in
that standard HTML does not mark up the hierarchical subdivisions of a
document apart from indicating the start of each level by <h1>, <h2>,
<h3>, etc. Therefore my problem is to find a general algorithm, probably
using recursion, to transform an HTML document into a valid XML equivalent,
in particular indicating its hierarchical structure. For instance, suppose
I have an HTML source like this:

<html>
<h1>...</h1>....
<h2>...</h2>....
<h2>...</h2>....
<h3>...</h3>....
<h1>...</h1>....
<h2>...</h2>....
<h3>...</h3>....
<h3>...</h3>....
<h2>...</h2>....
</html>

this should become semething like

<html>
<sect1><title>...</title>
....
<sect2><title>...</title>
....
</sect2>
<sect2><title>...</title>
....
<sect3><title>...</title>
....
</sect3>
</sect2>
</sect1>
<sect1><title>...</title>
....
<sect2><title>...</title>
....
</sect2>
<sect3><title>...</title>
....
</sect3>
<sect3><title>...</title>
....
</sect3>
</sect2>
<sect2><title>...</title>
....
</sect2>
</sect1>
</html>

So the question is how to know each time a <hx> (h1, h2, h3, ...) element
is encountered what are the "open h" levels less than or equal to that
of the current element, so that we can "close" them. In particular, before
exiting the document we should also close the complete hierarchy correctly.

I have read with interest an article by Benoit Marchal mentioned here
recently: "recurse, not divide, to conquer", where he describes the use of
recursion for "hierarchising" a flat document, but I cannot really see how
to apply his approach in the present case without somehow also knowing the
"state" (hierarchical level) at the given point in the document. Reading
the discussion of recursion in MK's book or in "Professional XSL" did not
make me a lot wiser on how to solve this in an elegant way. Therefore, all
suggestions are very welcome. Thanks in advance. mg

Dr. Michel Goossens              Phone:(+41 22) 767-4902
CERN, IT Division                Fax:  (+41 22) 767-8630
CH-1211 Geneva 23, Switzerland   Email: michel.goossens@xxxxxxx


XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list


======================================================================
Wendell Piez                            mailto:wapiez@xxxxxxxxxxxxxxxx
Mulberry Technologies, Inc.                http://www.mulberrytech.com
17 West Jefferson Street                    Direct Phone: 301/315-9635
Suite 207                                          Phone: 301/315-9631
Rockville, MD  20850                                 Fax: 301/315-8285
----------------------------------------------------------------------
  Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================


XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list



Current Thread