Re: [xsl] Verifying large XSL transform output

Subject: Re: [xsl] Verifying large XSL transform output
From: Paul Tyson <phtyson@xxxxxxxxxxxxx>
Date: Tue, 11 Feb 2014 20:35:09 -0600
Hi Matthew,

Schematron is your best friend for validating XML content.

1. Sketch out some schematron rules that would validate your output
(with reference to the source).
2. Write one or more XSLT stylesheets to generate schematron rules from
your input, to validate specific content in the target document.
3. Run your production transformation and your schematron rule generator
over the source documents.
4. Compile the schematron files into xslt.
5. Run the schematron xslt files against your output to get SRVL
(Schematron Report Validation Language) files (or whatever format you
like).
6. (Optional) Transform SRVL files to a readable report form (e.g.
HTML).
7 (Optional) Put it all together in an Xproc pipeline and automate!
8. Iterate steps 1-6 until there are no further improvements to be made
and you are satisfied with the validation.

Have fun.

Regards,
--Paul

P.S. I did this a while back to validate several thousand XML documents
that were generated by a sausage-grinder conversion (not XSLT) from flat
files (think spreadsheets). The requirement was to check
hundreds--sometimes thousands--of data fields in each file for exact
match with input. The process worked very well. There are a few gotchas
to watch out for. You'll have to be careful with quoting, variables,
curly braces, namespaces, and xpath expressions since you're writing
xslt to generate a file (in schematron language) that will itself be
turned into xslt. Character entities may also be a problem, so you'll
have to preserve those through all the transformation steps. But once
you get the hang of it, it goes very well.

P.P.S.
I'm not in the regular business of writing xslt to transform documents,
but it seems this approach would be a good way to implement test-driven
stylesheet development. You could co-develop the validation rules and
the transformation, and test as you go, using real input.

Best,
--Paul

On Tue, 2014-02-11 at 10:36 -0500, Matthew Stoeffler wrote:
> I have a question about verifying XSL transform output.  I'm moving
>  somewhat large XML docs --digital books-- from one format into another
>  (archival) format, with lots of pulling and pushing.  The source
>  format is, euphemistically speaking, 'interesting', and not the kind
>  of thing you'd necessarily want to emulate: lots of too-loose content
>  models granting multiple structural variations for the same
>  intellectual object; much cross-document referencing via PIs, etc. 
>  The transform scripts are large.  I know my results are valid in the
>  new format; I'm now trying to confirm that I'm capturing all the
>  content.  I've done analysis of ID's from source to output.  I have
>  contemplated ways of counting text nodes, or text string length, as
>  another possible approach.  I'd love some feedback from the list on
>  metrics others have tried and what seems to work best.  Thanks in
>  advance.
> 
> m./

Current Thread