Re: [xsl] use xsl to generate statistics of collection of XML documents.

Subject: Re: [xsl] use xsl to generate statistics of collection of XML documents.
From: Wendell Piez <wapiez@xxxxxxxxxxxxxxxx>
Date: Mon, 19 Oct 2009 13:20:18 -0400

At 11:54 AM 10/19/2009, you wrote:
I am wondering if someone has written some XSL to generate statistics
of a collection of XML documents.  Thus it would provide per node
statistics (usage), and node relationships statistics (order/ nesting).

Sure, I've done this and so, I imagine, have others on this list.

  My goal would be to generate new sample XML documents from
the statistics.    This would be similar to generating XML documents
from probabilistic production rules--but the generated documents
should pass either a DTD or Schema validator.  I do realize that there
are semantics that need to be accounted for.  That would be a future
goal.  I've tried generating sample documents from a schema using
XMLSpy--does it have some way of recording probability into the schema?

This is a tougher nut to crack, but I don't see why it couldn't be done. A stylesheet could process a report generated in XML to generate arbitrary samples. (The archives of this list would provide help with the randomizing aspect.) You would perhaps want to make sure in doing so that you limited, for example, how deeply a result document would nest, assuming you had content models allowing for recursion.

I don't know of any commercial tools that do this, quite.

If you work on this, consider also the stylesheet that would make a "pathological" variant document -- one that had at least one example of every attested construct from a set of documents. Such a tool would be very useful.


Wendell Piez                            mailto:wapiez@xxxxxxxxxxxxxxxx
Mulberry Technologies, Inc.      
17 West Jefferson Street                    Direct Phone: 301/315-9635
Suite 207                                          Phone: 301/315-9631
Rockville, MD  20850                                 Fax: 301/315-8285
  Mulberry Technologies: A Consultancy Specializing in SGML and XML

Current Thread