Re: [xsl] Generic stylesheet to flatten XML hierarchy

Subject: Re: [xsl] Generic stylesheet to flatten XML hierarchy
From: Sara Mitchell <samitchell6@xxxxxxxxx>
Date: Mon, 7 Dec 2009 10:49:01 -0800 (PST)
I know that this may not work in every case. Basically the rules are: 

*
every attribute on an element becomes a column in a row
* every element that
has data content becomes a column in a row
* repeating elements define a row
-- with the further restriction that if there are hierarchical levels of
repeating elements (nested), the final lowest level of repeating elements
defines a row and ancestor levels get repeated
* hierarchical relationships
get flattened
* siblings at any level that don't repeat get repeated in each
row

I'm going to try one last possible solution using keys and XPath, I
think, and if that does not work I may move on to Michael Kay's suggestion of
a meta-stylesheet. 

Thanks to everyone for the ideas.

--- On Fri, 12/4/09,
C. M. Sperberg-McQueen <cmsmcq@xxxxxxxxxxxxxxxxx> wrote:

> From: C. M.
Sperberg-McQueen <cmsmcq@xxxxxxxxxxxxxxxxx>
> Subject: Re: [xsl] Generic
stylesheet to flatten XML hierarchy
> To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
>
Cc: "C. M. Sperberg-McQueen" <cmsmcq@xxxxxxxxxxxxxxxxx>
> Date: Friday,
December 4, 2009, 6:35 PM
> On 4 Dec 2009, at 12:37 , Sara
> Mitchell wrote:
>
> > ...
> > 
> > With input like this:
> > <rss ...some attributes>
> >   ...
> > </rss>
> > 
> > I would like XML output like this:
> > 
> > <root>
> >
<row>
> >  <rss-attr1>value</rss-attr1>
> > ...
> > </row>
> > <row>...again
rss attributes, channel
> attributes, non-repeating children of channel
followed by
> fields for second item </row>
> > ...more rows ...
> > </root>
>
> I'm having trouble seeing exactly what should be going on
> here,
> because
I can't see anything in your sample input (elided
> here
> without loss of
generality) that gives rise to the name
> 'rss-attr1'.  It's hard to correlate
input with output
> if
> all the values are spelled 'value' and some details
in one
> half of the input / output pair correspond to ellipses in
> the
>
other.
> 
> 
> 
> > 
> > This example is for a single level of repeating
>
descendants, but my solution has to be able to handle any
> level of repeating
descendants. More over, the stylesheet
> has no knowledge of the structure of
the input document.
> 
> My very strong gut reaction here is to suspect that
such
> an
> absolutely generic transformation is unlikely to produce
> helpful
> (or: meaningful) output in some unknown but possibly large
> percentage of
cases.
> 
> Perhaps the transformation you have in mind is intended to
> work
generically on all XML documents that follow certain
> conventions in
structuring the information they represent?
> Can you say what those
conventions are?
> 
> Perhaps you have a very clear understanding of the
>
transform you
> want, but so far this discussion has not elicited a clear
>
description from you.  The following questions are
> intended to
> try to
elicit some more clarity.
> 
> In a generic XML document, there are elements
with
> parents,
> left and right siblings, children, descendants, and
>
attributes.
> 
> In a generic table, there are rows and columns.  Each
> row
but
> the first or last has a predecessor and a successor, and
> ditto
> each
column but the first or last.
> 
> What is the relationship between the
elements, attributes,
> containment and sibling relations in the input, and
the
> rows and columns and their sequence relations in the
> output?
> 
>
Given your output table, should I expect to have all the
> information present
in the XML?  Can I recreate the
> XML from
> your table?
> 
> Do all your rows
have the same number of columns?  (I
> suppose
> they must, or it's not much
of a table, but perhaps I'd
> better check?)
> 
> When does an XML document
give rise to a single row in the
> output
> table?  When does it give rise to
exactly three
> rows?  When
> does the resulting table have exactly one
column?
> 
> What information do the labels of columns convey?
> 
> What
tables would you want to produce for the documents
> 
> (1) <e/>
> (2) <e><e
n="23"/><e
> n="45">Pax</e></e>
> (3) <table>
>     <row a="1" b="2"
>
c="34">998</row>
>     <row a="2" b="22"
> c="34">999</row>
>     <row a="3"
b="2"
> c="3">1000</row>
>     <row a="4" b="24"
> c="">1001</row>
>     <row
a="5" x="Viva Villa!"
> c="34">998</row>
>     </table>
> (4) <p>This isn't
mixed content, because the schema
> says I'm a string.</p>
> 
> ?
> 
> 
> > 
>
> I have a solution that works ok by traversing the
> input document in doc
order -- but it does not handle the
> siblings of repeating nodes that are not
themselves
> repeating.
> > 
> > I have thought of doing this the opposite
way, get a
> key of all repeating nodes and process only those at the
> lowest
depth to generate rows.  I haven't actually
> written the logic.
> 
> I gather
that the tables you want to generate have
> something
> to do with multiple
occurrences of elements with the same
> name.
> Does adjacency matter, or
would
> 
> 
> <a><b/><b/><b/><c/><c/><c/></a>
> 
> be treated differently from
> 
> 
> <a><b/><c/><b/><c/><b/><c/></a>
> 
> ?  (Assume if you like, for
purposes of discussion,
> that the b and c
> and a elements all have
interesting attributes.)
> 
> > 
> > Any better ideas would be welcome.
> 
>
Your example reminds me of the contortions I've seen
> people
> go to trying
to represent structured information in RFC
> 822
> attribute-value pairs.  So
the best idea I have at the
> moment
> is:  Save yourself!  Don't do it!
> 
>
But probably you know exactly what you're doing, there is a
> perfectly
>
reasonable algorithm for what you want, and I just haven't
> understood.
> 
>
hth
> 
> --****************************************************************
>
* C. M. Sperberg-McQueen, Black Mesa Technologies LLC
> *
http://www.blackmesatech.com
> * http://cmsmcq.com/mib
> * http://balisage.net
> ****************************************************************
> 
> 
> 
>
> 
> --~------------------------------------------------------------------
>
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
> To
unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
> or e-mail:
<mailto:xsl-list-unsubscribe@xxxxxxxxxxxxxxxxxxxxxx>
> --~--

Current Thread