Re: [xsl] Running XSLT from Python

Subject: Re: [xsl] Running XSLT from Python
From: "dvint dvint@xxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Mon, 20 Jan 2025 16:34:40 -0000
Thanks for the update. Until this last year I didn't really have anywhere I
could have used xproc so I didn't really keep up with what was going.Sent from
my Verizon, Samsung Galaxy smartphone
-------- Original message --------From: "Wendell Piez wapiez@xxxxxxxxxxxxxxx"
<xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> Date: 1/20/25  6:15 AM  (GMT-08:00)
To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx Subject: Re: [xsl] Running XSLT from
Python Dan,This is XSL-List and there are plenty of XProc links to be
found.However, your experience warrants a brief summary of the big differences
between XProc 3 (today) and XProc 1.0 (ten years ago and more):While XProc 1
could be made to work effectively, there was only one processor (tmk) and it
was cumbersome for end users or 'home developers' to set up and run under
distribution models of the day. This was at a period when XSLT pipelining was
also already pretty common using batch scripts, GNU make, Ant etc. etc. So
people were getting the job done.XProc 1 handled only XML, meaning if you had
HTML, JSON or other you had to do it the hard way anyway - all the mishmash
around HTML handling, for example, is now built into XProc. This is a very
nice problem for it to handle for us, especially since we also have JSON
routinely (etc.) and it does all that. It makes XSLT work on anything (in
principle and surprisingly well in fact), not just XML. (Amaze your JSON
friends when you XSLT their JSON!)XProc 3.0 supports a more concise and dare
one say "elegant" syntax, which can be lightweight when the problem is
lightweight. XProc 1.0 was always a beast to get working, especially the first
time.XProc 3.0 embeds XPath 3.1 and works well with latest-generation XQuery
and XSLT.There are already useful libraries for XProc 3.0/3.1 for some
specialized needs.I hope that isn't too far off topic ...!Thanks, WendellOn
Sun, Jan 19, 2025 at 1:22b/PM dvint dvint@xxxxxxxxx
<xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote:Personally, I tried xproc
about 10yrs ago. I was successful after much stumbling. I was building in
oxygen and setting it up to run within that environment. I documented the
setup and tried to have a coworker run the xproc. For some reason I was never
able to get it to work in their environment. I don't remember if it worked
inconsistently or just never worked.Most of my work since hasn't lent itself
to that approach so I haven't had a reason. I suppose some of the work I've
done in the last year or so would have worked with this approach but it was
one time use and I just used some shell script for the automation or oxygen
transformation.Someone mentioned xproc handles html, I might give it a try
with my current problem after I experiment a little with Saxon and
Python...danSent from my Verizon, Samsung Galaxy smartphone-------- Original
message --------From: "Wendell Piez wapiez@xxxxxxxxxxxxxxx"
<xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> Date: 1/19/25  9:56 AM  (GMT-08:00)
To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx Subject: Re: [xsl] Running XSLT from
Python Hello Dan,What you describe (devs skittish about XSLT) is a widespread
problem, and a severe one.From what I have observed, XProc adoption faces
headwinds. Keep in mind that a common use case for XProc is as a 'shell'
around XSLT-based operations and workflows. (Things like merging and
indexing.) But the people who are already using XSLT have something that works
well enough for them today, and they would rather not think about it.And to
the extent there are pressures to modernize and upgrade systems, teams would
rather move away from XML/XSLT altogether, if only because it terrifies them.
(Maybe I exaggerate or maybe I don't.)This faces us with the paradox of no one
trying XSLT in new environments and architectures because no one is trying
XSLT in new environments and architectures.Oh - I should qualify - *in public*
- we don't know what people are doing who are not talking about it.The biggest
benefit of XProc 3.0 in my view is that it promises sustainability (assuming
we do our work) beyond the sustainability of a particular toolchain.But even
bigger than 'the biggest benefit' at this moment, it is also possible to build
and deploy XDM-based processes (XProc with embedded XSLT/XQuery as needed)
that are deterministic, verifiable, and testable. Rigorous testability is not
a requirement for every system at every level. But if some kinds of systems
require rigorous testability, the technology as a whole needs to be able to
support it.The accessibility (wrt to both openness and sustainability) and
testability of XProc and XSLT stand in marked contrast to the kinds of black
box processes that are now being entrusted these days with various kinds of
vital and not-so-vital operations.Yet at the same time, outdated information
and myths persist and even late-generation XSLT, XQuery and XProc are regarded
as not worth the trouble, while developers think about which hot new
technology they should be looking at.It seems to me there are opportunities
here for those bold enough to bear down against the
grain.https://github.com/usnistgov/oscal-xproc3.Regards, WendellOn Sat, Jan
18, 2025 at 8:26b/PM dvint dvint@xxxxxxxxx
<xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote:I hadn't but part of my
problem is the team is not xml aligned any more. I was trying to avoid xslt by
using Python when that seemed to fail me.Sent from my Verizon, Samsung Galaxy
smartphone-------- Original message --------From: "Wendell Piez
wapiez@xxxxxxxxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> Date:
1/18/25  2:44 PM  (GMT-08:00) To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx Subject: Re:
[xsl] Running XSLT from Python Dan,Have you considered XProc 3.0? It is able
to read HTML the same as it does XML. While bad inputs are bad inputs, it is
also good for detecting and/or repairing them. It can embed and use Schematron
and XSLT;B  you might also find many of the things you need to do are
achievable by XProc alone.Two XProc implementations are now available, Morgana
XProc III, and XML Calabash 3.0.More references can be provided --Regards,
WendellOn Fri, Jan 17, 2025 at 6:16b/PM dvint@xxxxxxxxx
<xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote:First off, is anyone aware of
a good way to merge a bunch of HTML
techdoc pages into a single HTML so a PDF file can be generated with
something like Prince or Weasyprint? I didn't find anything so I went
down this the following path.

For this effort I decided to see what coPilot would come up with for
this task. It has been an interesting experiment for the proof of
concept effort but now I need to get this production ready. I was also
initially trying to avoid using XSLT as I'm the only one on the team
that likes XLST and I was processing HTML that isn't well-formed.

CoPilot created some Python using BeautifulSoup initially. My forst
discovery is that Beautiful soup seems to be good for extracting content
from the HTML, but I haven't found a way to process it like XSLT - maybe
my mind has been warpped by XSL and tools like Omnimark and I just don't
see the path. Anyway after trying to do the job with BeautifulSoup, I
started looking for a way to integrate XSLT and coPilot took me to
lxml/etree.

With etree I was able to start developing the core part of the
processing. Here is the flow of the geenral program:
1) Extract the navigation/TOC from one of the HTML files. I did this
with BeautifulSoup because the HTML is not well-formed and I just needed
to extract a single element.
2) I processed all the HTML and made a new copy in a subfolder. Using
BeautifulSoup again, I extracted the body of the HTML pages. The body
content is well-formed, the head content isn't.
3) Using the extracted TOC/navigation from step 1 to drive the
processing, I created an XSLT that took that information and then
started processing the extracted content. I've been able to get a single
HTML file with all of the content. I haed to create unique IDs for all
the sections and modify the cross references to change them from file
references to links to anchors in the new file.

All of that is working great until there are errors in the HTML. This
HTML is generated with asciidoc. Occasionally, a writer will put quotes
in an alt text for an image. This results in mangled image references
that doesn't affect the visual rendering of the HTML, but XSLT trips up
on this. Other bad asciidoc has created some other other mangled HTML
which again isn't reported and doesn't affect the visual result. When
the XSLT hits this I get reasoanble error messages that tell me what the
problem is when I run in oXygen. I will get a message from Python that
just tells me it failed with the filename.

Can you confirm my understanding and that there isn't a way to get the
XSLT error and xsl:message strings I've created? Maybe Saxon in oXygen
is providing better information than lxml can?

I'm looking into switching to Saxon HE to see if that helps.

..dan


-- ...Wendell Piez... ...wendell -at- nist -dot- gov......wendellpiez.com...
...pellucidliterature.org... ...pausepress.org......github.com/wendellpiez...
...gitlab.coko.foundation/wendell...



XSL-List info and archive

EasyUnsubscribe
(by email)







XSL-List info and archive

EasyUnsubscribe
(by email)



-- ...Wendell Piez... ...wendell -at- nist -dot- gov......wendellpiez.com...
...pellucidliterature.org... ...pausepress.org......github.com/wendellpiez...
...gitlab.coko.foundation/wendell...



XSL-List info and archive

EasyUnsubscribe
(by email)







XSL-List info and archive

EasyUnsubscribe
(by email)



-- ...Wendell Piez... ...wendell -at- nist -dot- gov......wendellpiez.com...
...pellucidliterature.org... ...pausepress.org......github.com/wendellpiez...
...gitlab.coko.foundation/wendell...



XSL-List info and archive

EasyUnsubscribe
(by email)

Current Thread