RE: [xsl] How to mark every 5th output record.

Subject: RE: [xsl] How to mark every 5th output record.
From: Wendell Piez <wapiez@xxxxxxxxxxxxxxxx>
Date: Tue, 11 Mar 2008 10:54:49 -0400
Patrick,

Just because your logic is currently at 2900 lines of code doesn't mean it has to be. In fact, if its approach to processing is as imperative as what you've suggested you "should" be able to do, chances are reasonably good that someone who's familiar and comfortable with the XSLT processing model could reduce it radically by refactoring.

Nor is pipelining (the term of art for processing your output as input) inherently such a bad thing. Indeed, in XSLT 2.0, it can be done transparently in one stylesheet. Depending on your architecture and implementation, it need not be inefficient.

As Mike said, the details of what you are trying to do are critical. For one thing, if your logic is complex, that's an indication that the process you are designing involves upconversion. If so, you should tell us right off whether you can use XSLT 2.0 or whether you are limited to 1.0. XSLT 1.0 wasn't designed for upconversion (its general assumption is that the dataset is clean and optimally structured and ordered going in, and transformations are geared mainly to presentation not data processing), which isn't to say that it can't be done. Rather, it's to say that when consulting the experts on how to do things, you will constantly hear the refrain "It's easier in 2.0".

As you have learned, XSLT is declarative and functional, not imperative. Variables are variables in the sense they are in algebra -- values defined in relation to other values in a processing context -- not just labels for memory registers, which you can reassign at will (a dangerous and destructive practice, since this means that any bug is at risk of infecting parts of the system far beyond where it does its immediate damage). While for you at this moment, this fact may present an impediment to using XSLT well, it's still not really a problem, as it offers numerous advantages at many layers of the system including yours (once you know how to take advantage of it), especially as complexity scales up.

I know this is a defense, not a solution. But if your platform resources are really so tight, maybe you need something with a different processing model than XSLT (maybe a SAX filter or series of them, or a Perl or Python script), at least for part of your problem. If things are that difficult, there's a reason. Either you are trying to use the language for something it wasn't designed for and doesn't do well, or you are approaching it wrong. Or both. My guess, from your description, is that the specification itself is a monster, and that taming it would be difficult in any language.

As far as that goes, in general, there's filtering, grouping and sorting. Sometimes any or all of these require additional processing to determine criteria for them. Also, sometimes sorting has to happen before grouping (that is, logically prior if not necessarily temporally), sometimes after -- that is, both are reordering or rearranging operations (as is filtering, strictly speaking).

In my experience, the sequence (1) data analysis followed by (2) filtering followed by (3) reordering has made sense. Often (1) and (2) can be collapsed. If (1) is done well, usually (3) can be done in one pass. Your requirement is tricky because you want grouping to occur after filtering and sorting, which is often (though not always) impractical in one pass.

As Mike indicated before, XSLT 2.0 provides features that make necessary facilities for (1) (in the general case) available during later operations, which frequently reduces the need for pipelining since analysis can be done on the fly. On the other hand, when you need to pipeline, XSLT 2.0 makes that easier too.

Cheers,
Wendell

At 10:03 AM 3/11/2008, you wrote:
As I said the rules under which I process my list are quite complex. So much
so that my XSLT stylesheet is over 2900 lines of code (and yes, that's just
nuts).

Different records (and types of records) are processed using different
rules, other records are deferred for later processing, others merged
together to produce a final one, some are skipped altogether, some complex
operations are performed on yet another set of records, etc. The output file
format is crazy, and the spec for the file format is about as obscure and
obtuse as I have ever seen in 20 years programming.

But in the end, I end up with a text file that has 1 line per "output
record", but these "output records" have almost nothing to do with the input
records, and I need to separate them with a marker every 5th.

I can't really do  (position() mod 5) on my original input data because it
has no correlation to the order of the output records, and it's impossible
to create an expression that would select them properly in the order I need.

Is my only option to create another tree that contains all of my output
record results, and then iterate over that tree once again, and putput the
same data verbatim, only this time insert a marker every 5th?

Gheesh, talk about using a tank to shoot a bird.

I'm trying to avoid doing this for other reasons:

1) My input data set is quite large.
2) The XSLT processor is running on an embedded platform with limited
memory.
3) I'm already paying the price of doing a copy of the data in an earlier
pass, I'd like to not pay the price again.

Is there really, really, really _any_ other way of doing this without making
a 3rd copy of my data set?


======================================================================
Wendell Piez                            mailto:wapiez@xxxxxxxxxxxxxxxx
Mulberry Technologies, Inc.                http://www.mulberrytech.com
17 West Jefferson Street                    Direct Phone: 301/315-9635
Suite 207                                          Phone: 301/315-9631
Rockville, MD  20850                                 Fax: 301/315-8285
----------------------------------------------------------------------
  Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================

Current Thread