Hi Alain,
You find yourself in a typical legacy-heritage entanglement. It is this
kind of trouble that old legacy can give us and that costs companies
zillions in time & material.
see my comments,
Cheers,
-- Abel
Alain wrote:
Personally I would prefer Saxon: XSLT2.0 make things so much easier.
indeed.
But at work, the only thing that has been authorized for now is
Xalan-C. It is running in batches (jobs) on AIX machines.
The reason why they are not considering another transformation engine,
at the moment, is performance. Even for a small transformation if you
run Saxon or Xalan-J, you will have to set up an run a JVM in your
Unix batch.
Launching the JVM has a cost in memory and time.
And even if you don't count the JVM cost, Saxon is Java code, so it
has to pay the Java overload compared to a code written
in C++... Although Saxon may perform faster on some specific
templates where it has better optimisations, on an "average" template
it will still be slower because it's Java versus C++.
You are mixing things up a bit. If you want that your apps run at
dazzling speed, you should code in C++, or ASM for that matter. But
that's not what you are doing. You are using XSLT, and that is an
interpreted language. In terms of speed, Saxon-J runs much faster than
Xalan-C. It might be that Xalan-J runs a bit slower than Xalan-C, but
that will only marginally be so (and if it is not marginally so, than
the port has been done badly).
Yes, starting the JVM has a cost. If you have many small batches, than
that's a problem. If they are large, than it is negligible. But it is
easy to workaround: let the JVM stay in-memory and you are done.
But this all is a useless discussion of course if "authorization" by the
AIX team is an issue. If you can use any XSLT 2.0 processor, it is
likely that your speed increases by a magnitude (I'm not talking
percentages, I am talking factors). The reason that I dare say that is
that you seem to use many recursive templates that are called quite
repeatedly. If you want me to help you port it (once you've convinced
the team that using JVM on AIX for XSLT will increase the batches' speed
by a magnitude) you can contact me off-list for that.
The goal is to be able to run a 5 million base customer, so we have
to count every second in our batch process.
Just for comparison: I've done a job for KPN (largest phone company in
Holland) that sends 8 million invoices each month in 14 batches. Each
batch processes between 2 and 4 GB of data. Using XSLT 1.0 this was a
nightmare, a batch taking up to 14 hours. Using XSLT 2.0 this has become
a breeze and it runs a batch in about one to two hours (there's more to
it than only this of course, like that another process creates the AFP
files for the printer and PDF is output for WORM tape, all in the same
time).
If you have to code for speed, there's no other option than to switch to
XSLT 2.0 and the JVM.
So they are definitely running a JVM inside main the batch,
so, what are you waiting for? Let it run Saxon as well ;)
substring(concat(myString,$padding),1,$N) to pad it correctly
In XSLT 2.0 you can do:
$myString, for $i in 1 to $FieldLen - string-length($myString) return ' '
(the comma is intentional) or anything similar. But you are right, the
concat-trick is just as easy.
I think I saw a padding function in EXSLT, but it doesn't seem to have
been made standard in 2.0
indeed, it is not.
Or we could probably write (or buy) "generic" patterns to transform to
fix-length.
I have them on the shelf, I use them regularly. If you are interested.... ;)
The last bit of headache is the "UTF-8" problem !
Because fixed-length is fixed-length in *bytes*.
aha, of course. The eternal legacy problem: back in the 70s they didn't
think international yet...
For that, with XSLT1.0, I agree with you, I had to build insane
recursive templates to calculate the length in bytes of an
UTF-8 string.
This is practically impossible because you don't know exactly how the
serializer will serialize. I.e., when it will use < and when <.
Furthermore, UTF-8 can be encoded in different ways for one single
character. In XSLT 2.0 you can cover this with the normalize-unicode
attribute of xsl:output, in XSLT 1.0 you cannot and I haven't found a
note on how to treat it.
If you have XSLT 1.0 and you want to know exactly the size of bytes, use
UTF-32 and you can (almost) be certain of the correct length (apart from
the < / " etc). Drawback is the almost 4-fold increased size
(you can use UTF-16 if all you need are the plane-1 characters).
[...]
or is there a function I didn't notice that can return a string length in
bytes and not in chars ?
Yes and no. But there's a simple trick. And this will solve your
problems 100%, I believe, as long as you can find your bosses to move
onto Saxon, because that's the only processor I found that can do it
correctly. Forget serializing + reading back as unparsed-text, use this
instead:
<xsl:output name="output-def" encoding="UTF-8" normalization-form="NFD"
omit-xml-declaration="yes" />
<snip ... />
<xsl:variable name="serialized" select="saxon:serialize($my-result-tree,
'output-def')" />
<xsl:variable name="hexBin"
select="saxon:string-to-hexBinary($serialized, 'UTF-8')" />
<xsl:variable name="length" select="string-length(xs:string($hexBin))
div 2" />
I tested it, and it works even so well that it returns different amounts
when you choose different normalization-forms (i.e., Compose / Decompose
will give radically different results). It also correctly gives < as
4 characters when it is part of a text node or an attribute. It *does
not* correctly interpret cdata-section-elements on the xsl:output
definition, but that's only a minor inconvenience (and an insignificant
little bug in Saxon), it does correctly interpret the
omit-xml-declaration yes/no.
You must be careful that the selected encodings match. If they don't,
the string-to-hexBinary function will proof leading (logically so).
All-in-all, this is by far the easiest way to calculate the length of a
node in bytes. And you can use the resulting string to put into your
fixed-length system as you want:
<xsl:function name="f:padding" as="xs:string">
<xsl:param name="string" as="xs:string" />
<xsl:param name="width" as="xs:integer" />
<xsl:value-of select="$string, for $i in 1 to $width -
string-length($string) return ' ' " separator="" />
</xsl:function>
<snip ... />
<xsl:sequence select="f:padding($columnData1, 20)" />
<xsl:sequence select="f:padding($columnData2, 4)" />
<xsl:sequence select="f:padding($serialized,4096)" />
<xsl:sequence select="f:padding($columnData3, 400)" />
<xsl:sequence select="f:padding($columnData4, 2)" />
<xsl:sequence select="f:padding($columnData5, 12)" />
..... etc
Convinced that things *can* be easier in XSLT 2.0?
Though I only showed you very few XSLT 2.0 specific things. Your major
gain of switching to Saxon is that you can use the saxon:serialize()
function. Otherwise it will be quite hard to guarantee that your
recursive templates will be correct (I think that it is not so hard to
proof that they are incorrect, unless you really rewrite the
serialization algorithm of your processor in XSLT 1.0).
You came to the same conclusion, your advise been to separate
the variable part (e.g. HTML) in a temporary file, even if your
templates are smarter and to put every piece together again.
See above, using the right tools for the job, you will not need this
hard-to-maintain solutions anymore.
But as I'm on holidays now, I'll have to check the project
status when I'm back in September !
Enjoy your holidays!
Cheers,
-- Abel Braaksma