Subject: Aw: Re: [xsl] Using 'collection'|
From: "Syd Bauman s.bauman@xxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Sun, 30 Aug 2015 15:31:08 -0000
Mark -- Just to make sure I'm understanding the problem right, you want to extract two particular elements (<foo> and <bar>) from a large number of large files. If I've got the problem wrong, you can stop reading here :-) First, nothing I say is to suggest that previous suggestions (XQuery against an XML database and collection() using saxon extension or streaming to avoid memory gobble) are bad in any way. They may well be better solutions than the following, but especially if this is a one-time extraction, you may find a command line XPath tool (or XMLsh) very helpful. XMLsh is a pretty complete XML processing environment that works like a typical unix shell. I have not used it much, but I know it includes the capability needed here. See http://www.xmlsh.org/ if interested. There are a variety of commandline XPath utilities available that run in your normal shell (e.g., bash). My favorite is `xmlstarlet` (invoked with `xml` on some systems), so I'll use it as an example. Here used inside the bash shell. $ cd /path/to/dir/with/8000/xml/files/ $ xmlstarlet -t -m "//foo|//bar" -c "." -n *.xml > all-foo-and-bar.txt That command says "run an XSLT program that has a template (-t) that matches all <foo> and <bar> (-m) and, for each, spit out a copy of the element you matched (-c) followed by a newline (-n)". Notice the output file is ".txt". That's because it's not XML, it has multiple elements at the top-most level, and thus is not well-formed. If you just add a wrapper element by hand, you get XML. (It is easy to get xmlstarlet to wrap the <foo>s and <bar>s from a given file with an element, even one that gives you the filename: $ xmlstarlet -t -e file -a fn -f -b -m "//foo|//bar" -c "." -n *.xml > afab.txt which adds "start with an element <file> (-e) that has an attribute @fn (-a) that has a value of the current input file's path (-f)". (The '-b' is a break that says "this is the end of the attribute definition".) But if you can add a wrapper element around all the output, I don't know how. The program also has namespace support: $ xmlstarlet -N me=http://www.example.edu/SB/ns -N you=http://www.example.org/MW/ns -t -m "//me:foo|//you:bar" -c "." -n *.xml > afab.txt But (AFAIK), there is no default namespace. (I.e., you're in XSLT 1.0 land, here. Which is, in fact, the case -- I think xmlstarlet just converts the commandline into a small XSLT 1.0 pgm and runs it.) And, of course, you have full XPath 1.0 power in there. So if a <bar> might be inside <foo>, and you don't want duplicates: $ xmlstarlet -t -m "//foo|//bar[not(ancestor::foo)]" -c "." -n *.xml > afab.txt And, if instead of getting a copy you just want the ID followed by a colon, a space, and the text value: $ xmlstarlet -t -m "//foo|//bar" -v "@xml:id" -o ": " -v "normalize-space(.)" -n *.xml > afab.txt You get the idea. Just the way I used to use Perl on the commandline constantly for throw-away one-liners to manipulate plain text (and still do, occasionally) I can use an XPath commandline tool to manipulate XML. HTH. P.S. The output usually has namespace declarations on every element, which I often don't want. Thus I often pipe the output through | perl -pe 's; xmlns(:[A-Za-z0-9._-]+)?=[^ \t\n\r>]+;;g;' In fact I do that so often, I have that perl step aliased to the simple-to-write "nons" in my .bashrc file.