Re: [xsl] Running XSLT from Python

Subject: Re: [xsl] Running XSLT from Python
From: "Martin Honnen martin.honnen@xxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Fri, 17 Jan 2025 23:23:52 -0000
On 18/01/2025 00:15, dvint@xxxxxxxxx wrote:
First off, is anyone aware of a good way to merge a bunch of HTML
techdoc pages into a single HTML so a PDF file can be generated with
something like Prince or Weasyprint? I didn't find anything so I went
down this the following path.

For this effort I decided to see what coPilot would come up with for
this task. It has been an interesting experiment for the proof of
concept effort but now I need to get this production ready. I was also
initially trying to avoid using XSLT as I'm the only one on the team
that likes XLST and I was processing HTML that isn't well-formed.

CoPilot created some Python using BeautifulSoup initially. My forst
discovery is that Beautiful soup seems to be good for extracting
content from the HTML, but I haven't found a way to process it like
XSLT - maybe my mind has been warpped by XSL and tools like Omnimark
and I just don't see the path. Anyway after trying to do the job with
BeautifulSoup, I started looking for a way to integrate XSLT and
coPilot took me to lxml/etree.

With etree I was able to start developing the core part of the
processing. Here is the flow of the geenral program:
1) Extract the navigation/TOC from one of the HTML files. I did this
with BeautifulSoup because the HTML is not well-formed and I just
needed to extract a single element.
2) I processed all the HTML and made a new copy in a subfolder. Using
BeautifulSoup again, I extracted the body of the HTML pages. The body
content is well-formed, the head content isn't.
3) Using the extracted TOC/navigation from step 1 to drive the
processing, I created an XSLT that took that information and then
started processing the extracted content. I've been able to get a
single HTML file with all of the content. I haed to create unique IDs
for all the sections and modify the cross references to change them
from file references to links to anchors in the new file.

All of that is working great until there are errors in the HTML. This
HTML is generated with asciidoc. Occasionally, a writer will put
quotes in an alt text for an image. This results in mangled image
references that doesn't affect the visual rendering of the HTML, but
XSLT trips up on this. Other bad asciidoc has created some other other
mangled HTML which again isn't reported and doesn't affect the visual
result. When the XSLT hits this I get reasoanble error messages that
tell me what the problem is when I run in oXygen. I will get a message
from Python that just tells me it failed with the filename.

Can you confirm my understanding and that there isn't a way to get the
XSLT error and xsl:message strings I've created? Maybe Saxon in oXygen
is providing better information than lxml can?



Perhaps post minimal but complete samples of XML and XSLT and the "reasonable error messages" that you get. I am afraid it is not clear what you are doing in XSLT and how that fails in Python to produce a reasonable error message. Currently it sounds like your input is not well-formed and the XML parser fails, although that doesn't explain why you would be able to output an xsl:message, unless you are using xsl:try on fn:parse-xml and use xsl:message in xsl:catch.

Current Thread