Re: [xsl] Running XSLT from Python

Subject: Re: [xsl] Running XSLT from Python
From: "dvint dvint@xxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Wed, 22 Jan 2025 01:21:31 -0000
Ok, here are the messages I've been getting. This is from my program and
lxml:

Traceback (most recent call last):
  File "/Users/danvint/pubsrc/adoc2PDF/02-write-single-html-to-xml.py", line
213, in <module>
    doc.body.append(BeautifulSoup(str(build_content(soup_nav)),
features="xml"))
                                      ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/danvint/pubsrc/adoc2PDF/02-write-single-html-to-xml.py", line
178, in build_content
    result = transform(xml_doc)
             ^^^^^^^^^^^^^^^^^^
  File "src/lxml/xslt.pxi", line 583, in lxml.etree.XSLT.__call__
  File "src/lxml/etree.pyx", line 332, in
lxml.etree._ExceptionContext._raise_if_stored
lxml.etree.XSLTApplyError: Cannot resolve URI
/Users/danvint/_stash/pingone-cloud-docs2/target/build/site/pingone/_pdf_buil
d/strong_authentication_mfa/p1_pid_configuring_android_work_for_workspace_one
_uem.html

I get the name of the file with no indication of what the problem is. Here is
what oXygen was showing my running Saxon shown in the attachment. There are
other benefits of running in oXygen, but at least I know what element has an
issue and the file is open in the editor - with the cursor on the image ref
that is causing the problem.

I was able to get Saxon running in Python (with some help from the list) and
as I suspected the message was more useful than what was provided by lxml.

SaxonC-HE 12.5 from Saxonica
Error on line 64 column 6 of
p1_pid_configuring_android_work_for_workspace_one_uem.html:
  SXXP0003   Error reported by XML parser: Element type "img" must be followed
by either
  attribute specifications, ">" or "/>".
Error at char 10 in expression in xsl:apply-templates/@select on line 104
column 68 of 02a-build_content.xslt:
  FODC0002  SXXP0003   Error reported by XML parser: Element type "img" must
be followed by
  either attribute specifications, ">" or "/>".
  In template rule with match="a" on line 61 of 02a-build_content.xslt
     invoked by built-in template rule (text-only)

lxml/etree was limited to v1 xslt, so I'll be switching over to Saxon as it
will allow my to use v3 XSLT. I'll also be taking a look at Tidy again with
the added element list and I may be able to avoid some of the hoops and bring
most of this processing in XSLT by using XPROC.

..dan


https://dannyvintphotography.com <https://dannyvintphotography.com/>
https://dvint.com <https://dvint.com>



o;?On 1/17/25, 3:23 PM, "Martin Honnen martin.honnen@xxxxxx
<mailto:martin.honnen@xxxxxx>" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx
<mailto:xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>> wrote:




On 18/01/2025 00:15, dvint@xxxxxxxxx <mailto:dvint@xxxxxxxxx> wrote:
> First off, is anyone aware of a good way to merge a bunch of HTML
> techdoc pages into a single HTML so a PDF file can be generated with
> something like Prince or Weasyprint? I didn't find anything so I went
> down this the following path.
>
> For this effort I decided to see what coPilot would come up with for
> this task. It has been an interesting experiment for the proof of
> concept effort but now I need to get this production ready. I was also
> initially trying to avoid using XSLT as I'm the only one on the team
> that likes XLST and I was processing HTML that isn't well-formed.
>
> CoPilot created some Python using BeautifulSoup initially. My forst
> discovery is that Beautiful soup seems to be good for extracting
> content from the HTML, but I haven't found a way to process it like
> XSLT - maybe my mind has been warpped by XSL and tools like Omnimark
> and I just don't see the path. Anyway after trying to do the job with
> BeautifulSoup, I started looking for a way to integrate XSLT and
> coPilot took me to lxml/etree.
>
> With etree I was able to start developing the core part of the
> processing. Here is the flow of the geenral program:
> 1) Extract the navigation/TOC from one of the HTML files. I did this
> with BeautifulSoup because the HTML is not well-formed and I just
> needed to extract a single element.
> 2) I processed all the HTML and made a new copy in a subfolder. Using
> BeautifulSoup again, I extracted the body of the HTML pages. The body
> content is well-formed, the head content isn't.
> 3) Using the extracted TOC/navigation from step 1 to drive the
> processing, I created an XSLT that took that information and then
> started processing the extracted content. I've been able to get a
> single HTML file with all of the content. I haed to create unique IDs
> for all the sections and modify the cross references to change them
> from file references to links to anchors in the new file.
>
> All of that is working great until there are errors in the HTML. This
> HTML is generated with asciidoc. Occasionally, a writer will put
> quotes in an alt text for an image. This results in mangled image
> references that doesn't affect the visual rendering of the HTML, but
> XSLT trips up on this. Other bad asciidoc has created some other other
> mangled HTML which again isn't reported and doesn't affect the visual
> result. When the XSLT hits this I get reasoanble error messages that
> tell me what the problem is when I run in oXygen. I will get a message
> from Python that just tells me it failed with the filename.
>
> Can you confirm my understanding and that there isn't a way to get the
> XSLT error and xsl:message strings I've created? Maybe Saxon in oXygen
> is providing better information than lxml can?
>
>


Perhaps post minimal but complete samples of XML and XSLT and the
"reasonable error messages" that you get. I am afraid it is not clear
what you are doing in XSLT and how that fails in Python to produce a
reasonable error message. Currently it sounds like your input is not
well-formed and the XML parser fails, although that doesn't explain why
you would be able to output an xsl:message, unless you are using xsl:try
on fn:parse-xml and use xsl:message in xsl:catch.

[demime 1.01d removed an attachment of type image/png which had a name of image.png"; x-mac-creator="4F50494D"; x-mac-type="504E4766]

Current Thread