Date: Wed, 16 Jan 2008 15:10:16 -0800
I'm still a little confused and this is mainly because of the generalized cases you keep alluding to.

The tool will convert html pages into an XML file that contains the content and its structure and will also produce an XSLT file. Is that correct? The "arbitrary data" is the content of such pages.

Most html pages are not well-formed XML. What am I missing?

Any set of data marked up as well-formed XML can be used as the data set. The documentation is stored in XML in a regular structure, as is any XML. Otherwise, the incoming data set has no special properties.

