[xsl] finding a minimum set of references

Subject: [xsl] finding a minimum set of references
From: Graydon <graydon@xxxxxxxxx>
Date: Tue, 8 Jan 2013 20:37:30 -0500
So I have a lot (~600) tree elements that look like:

    <tree parent="attribution">
        <count count="1"/>
        <locations>
            <loc area="loiprop" cite="2010-07-prop10"/>
        </locations>
    </tree>
    <tree parent="block">
        <count count="10"/>
        <locations>
            <loc area="loiprop" cite="2004-09-bud2004"/>
            <loc area="loiprop" cite="2010-07-prop10"/>
            <loc area="loiprop" cite="2010-08-allprop"/>
        </locations>
    </tree>
    <tree parent="block">
        <key>section</key>
        <count count="6689"/>
        <locations>
            <loc area="CRCc945-en" cite="2606"/>
            <loc area="CRCc945-en" cite="402"/>
            <loc area="CRCc945-en" cite="6804"/>
            <loc area="CRCc945-en" cite="8301"/>
            <loc area="CRCc945-en" cite="8308.1"/>
        ....
        </locations>
        <child>section</child>
    </tree>

The @area,@cite pairs represent documents with the locations of particular patterns (via the @parent and the child elements of the tree elements) of documentation content, one tree element per pattern.  (So there are lots of tree/@parent="block" elements, etc.)

What's wanted is the minimum number of documents that contain _all_ the patterns, for output testing purposes.

So I need to produce the, or at least _a_, shortest list of <loc/> elements so that every (all ~600) tree element contains at least one loc element on the list.

This has turned out to be much less straightforward than I thought.  I can't shake the feeling that there's a grouping solution, but I'm not seeing it if there is.  So I'd appreciate any algorithm hints anybody has got; unfortunately, efficiency is a concern.

Demonstration that this is really an NP-complete problem also grateful accepted!

Thanks!
Graydon

Current Thread