Re: HTML->DocBook? (Re: HTML -> RTF)

Subject: Re: HTML->DocBook? (Re: HTML -> RTF)
From: Oisin McGuinness <oisin@xxxxxxxx>
Date: Wed, 24 May 2000 12:11:47 -0400

Since Gary indicated had no archives for last year
for comp.text.sgml, I'm posting a copy of the
posting of?Christopher Browne last year.

Please excuse the length, it seems to be wanted...

Everything between <quote> and </quote> is his.

I have not tested this extensively.
I didn't find any newer version of this on his web site (in the .sig at the end).

Oisin McGuinness

>From - Fri Jun 11 12:57:55 1999
Xref: comp.text.sgml:15395
From: cbbrowne@xxxxxxxxxxxx (Christopher Browne)
Newsgroups: comp.text.sgml
Subject: Re: Search for Holy Grail: {html,ps,text}2sgml
References: <7jobbo$2urc$1@xxxxxxx> <7joh0t$337$2@xxxxxxxxxxxxxxxxxxxx> <7joif7$ok$1@xxxxxxx>
Reply-To: cbbrowne@xxxxxxx
X-Newsreader: slrn ( Windows)
Lines: 188
Message-ID: <TBY73.3916$_m4.78408@xxxxxxxxxxxxxxxxxx>
NNTP-Posting-Date: Thu, 10 Jun 1999 19:18:59 CDT
Organization: Giganews.Com - Premium News Outsourcing
X-Trace: sv1-sR9dUz5KKnfUgtyW/83eo35REq2btQPNoz54fJRvop128On8quxwsI2oWRUyCF7C9EL954AEHFrduWO!UEMnKV2D7tk=
X-Complaints-To: abuse@xxxxxxxxxxxx
X-Abuse-Info: Please be sure to forward a copy of ALL headers
X-Abuse-Info: Otherwise we will be unable to process your complaint properly
Date: Fri, 11 Jun 1999 00:18:59 GMT

On 10 Jun 1999 14:35:19 GMT, Marc G. Fournier <scrappy@xxxxxxx> wrote:
>jdassen@xxxxxxxxxxxxxxxx (J.H.M. Dassen (Ray)) writes:
>>Marc G. Fournier <scrappy@xxxxxxx> wrote:
>>>	I swear, I'm searching for the Holy Grail here, its about as
>>It is impossible, in any meaningful sense.
>>SGML is about document structure. PostScript, plain ASCII and to some
>How is it that the html2sgml(linuxdoc) converter works then?  

It's impossible to come up a provably complete system that will make the
SGML document "colloquial" for its DTD.

I use the DSSSL listed below to turn HTML that uses a small subset of
the available HTML tags into something that's pretty easy to integrate
into DocBook. 

It's useful enough for expressing the very limited structuring that HTML
provides, essentially being aware of:
a) Headings <H1>, <H2>, ...
b) Paragraphs
c) Some modifiers (<TT>, <B>)
d) Itemized lists
e) URLs

That's a tiny subset of HTML, and is mapped onto a tiny subset of what
DocBook offers.  It happens to be enough to be fairly useful.  But I'd
not call it a complete "conversion."  

And to convert documents (say) in Postscript, where it may not even be
possible to group more than lines of text together, into SGML *or any
other structured system* is nigh unto impossible. 

<!doctype style-sheet PUBLIC "-//James Clark//DTD DSSSL Style Sheet//EN">

(define debug
  (external-procedure "UNREGISTERED::James Clark//Procedure::debug"))

(declare-flow-object-class element
  "UNREGISTERED::James Clark//Flow Object Class::element")
(declare-flow-object-class empty-element
  "UNREGISTERED::James Clark//Flow Object Class::empty-element")
(declare-flow-object-class document-type
  "UNREGISTERED::James Clark//Flow Object Class::document-type")
(declare-flow-object-class processing-instruction
  "UNREGISTERED::James Clark//Flow Object Class::processing-instruction")

(declare-characteristic preserve-sdata?
  "UNREGISTERED::James Clark//Characteristic::preserve-sdata?"

(define (copy-attributes #!optional (nd (current-node)))
  (let loop ((atts (named-node-list-names (attributes nd))))
    (if (null? atts)
        (let* ((name (car atts))
               (value (attribute-string name nd)))
          (if value
              (cons (list name value)
                    (loop (cdr atts)))
              (loop (cdr atts)))))))

(default (if (node-property 'momitend (current-node))
		(make empty-element attributes: (copy-attributes))
		(make element attributes: (copy-attributes))))

(element HTML
    (make sequence
	(make document-type 
		name: "ARTICLE" 
		public-id: "-//Davenport//DTD DocBook V3.0//EN")

(element article (make element))

(element title (make element))

(element head
    (make element gi: "Artheader"))

(element BODY
        (make element gi: "Para"))

(element h1
    (make element gi: "Sect1" ))

(element h2
    (make element gi: "Sect2" ))

(element h3
    (make element gi: "Sect3" ))

(element h4
    (make element gi: "Sect4" ))

(element h5
    (make element gi: "Sect5" ))

(element heading
    (make element gi: "Title"))

(element p
    (make element gi: "Para"))

(element tt
    (make element gi: "Literal"
	attributes: `(("remap" "tt")))) ;; fixme

(element tscreen (process-children)) ; FIXME

(element ul
    (make element gi: "ItemizedList"))

(element li
   (make element gi: "ListItem" 
	(make element gi: "Para")))

(element URL
    (make element gi: "ULink"
	  attributes: `(("URL" ,(attribute-string "URL")))
	  (if (attribute-string "NAME")
		(literal (attribute-string "NAME"))
		(literal (attribute-string "URL")))))

(element IMG
   (make element gi: "Inlinegraphic"
        attributes: `(("Fileref" ,(attribute-string "SRC"))

(element A
        (attribute-string "HREF")
        (make element gi: "Ulink"
               attributes: `(("URL" ,(attribute-string
        (make element gi: "Anchor"
               attributes: `(("ID" ,(attribute-string "NAME"))))))

(element label 
   (make empty-element gi: "Anchor"
	attributes: (copy-attributes)))

(element ol
    (make element gi: "OrderedList"))

(element em
    (make element gi: "Emphasis"))

(element bf
    (make element gi: "Literal"
		  attributes: `(("remap" "bf"))))

(element pre
    (make element gi: "ProgramListing"))

(element quotep (process-children))

(element dl
   (make element gi: "GlossList"
	(process-matching-children "DT")))

(define (get-sibs)
    (let loop ( (rest (follow (current-node)))
		(accum (empty-sosofo)))
	(let ( (tag (gi (node-list-first rest))))
	    (if (or (not tag)
		    (string=? tag "DT"))
		(loop (node-list-rest rest)
		    (sosofo-append accum 
			    (node-list-first rest))))))))

(element DT
   (make element gi: "GlossEntry"
        (make element gi: "GlossTerm")
        (make element gi: "GlossDef" 

(element BR
    (make element gi: "Emphasis"))
OS/2: Why marketing matters more than technology...
cbbrowne@xxxxxxxx <>


 DSSSList info and archive:

Current Thread