RE: [xsl] HTML section headings to XML document sections

Subject: RE: [xsl] HTML section headings to XML document sections
From: DPawson@xxxxxxxxxxx
Date: Thu, 9 Aug 2001 10:03:19 +0100
See http://www.dpawson.co.uk/xsl/sect2/flatfile.html#d130e254

I think this answers well.
Thank JT.

Regards DaveP

> -----Original Message-----
> From: Michel Goossens [mailto:Michel.Goossens@xxxxxxx]
> Sent: 09 August 2001 08:01
> To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
> Cc: Michel Goossens
> Subject: [xsl] HTML section headings to XML document sections
> 
> 
> I have a lot of XHTML documents (mostly sanitized HTML with 
> tidy and saved
> with the -asxml option) that I would like to transform into XML (e.g.,
> DocBook). The structure of HTML is however drastically different in
> that standard HTML does not mark up the hierarchical subdivisions of a
> document apart from indicating the start of each level by <h1>, <h2>,
> <h3>, etc. Therefore my problem is to find a general 
> algorithm, probably 
> using recursion, to transform an HTML document into a valid 
> XML equivalent, 
> in particular indicating its hierarchical structure. For 
> instance, suppose
> I have an HTML source like this:
> 
> <html>
> <h1>...</h1>....
> <h2>...</h2>....
> <h2>...</h2>....
> <h3>...</h3>....
> <h1>...</h1>....
> <h2>...</h2>....
> <h3>...</h3>....
> <h3>...</h3>....
> <h2>...</h2>....
> </html>
> 
> this should become semething like
> 
> <html>
> <sect1><title>...</title>
> ....
> <sect2><title>...</title>        
> ....
> </sect2>
> <sect2><title>...</title>
> ....
> <sect3><title>...</title>
> ....
> </sect3>
> </sect2>
> </sect1>
> <sect1><title>...</title>
> ....
> <sect2><title>...</title>        
> ....
> </sect2>
> <sect3><title>...</title>
> ....
> </sect3>
> <sect3><title>...</title>
> ....
> </sect3>
> </sect2>
> <sect2><title>...</title>
> ....
> </sect2>
> </sect1>
> </html>
> 
> So the question is how to know each time a <hx> (h1, h2, h3, 
> ...) element
> is encountered what are the "open h" levels less than or equal to that
> of the current element, so that we can "close" them. In 
> particular, before
> exiting the document we should also close the complete 
> hierarchy correctly.
> 
> I have read with interest an article by Benoit Marchal mentioned here
> recently: "recurse, not divide, to conquer", where he 
> describes the use of
> recursion for "hierarchising" a flat document, but I cannot 
> really see how
> to apply his approach in the present case without somehow 
> also knowing the
> "state" (hierarchical level) at the given point in the 
> document. Reading
> the discussion of recursion in MK's book or in "Professional 
> XSL" did not
> make me a lot wiser on how to solve this in an elegant way. 
> Therefore, all
> suggestions are very welcome. Thanks in advance. mg
> 
> Dr. Michel Goossens              Phone:(+41 22) 767-4902
> CERN, IT Division                Fax:  (+41 22) 767-8630
> CH-1211 Geneva 23, Switzerland   Email: michel.goossens@xxxxxxx
> 
> 
>  XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
> 

- 

NOTICE: The information contained in this email and any attachments is 
confidential and may be legally privileged. If you are not the 
intended recipient you are hereby notified that you must not use, 
disclose, distribute, copy, print or rely on this email's content. If 
you are not the intended recipient, please notify the sender 
immediately and then delete the email and any attachments from your 
system.

RNIB has made strenuous efforts to ensure that emails and any 
attachments generated by its staff are free from viruses. However, it 
cannot accept any responsibility for any viruses which are 
transmitted. We therefore recommend you scan all attachments.

Please note that the statements and views expressed in this email 
and any attachments are those of the author and do not necessarily 
represent those of RNIB.

RNIB Registered Charity Number: 226227

Website: http://www.rnib.org.uk 


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread