Re: [xsl] help with random numbers

Subject: Re: [xsl] help with random numbers
From: "C. M. Sperberg-McQueen cmsmcq@xxxxxxxxxxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Thu, 17 Nov 2022 02:25:54 -0000
Thanks for your reply.

"Michael Kay michaelkay90@xxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
writes:

> ...
>
> Worth noting in passing that there's a bug in the spec: it says that
> every xs:double value in the range 0 to 1 should be equally likely to
> appear, but that's not what you actually want, because there are about
> as many xs:double values in the range 0 to 0.1 as there are in the
> range 0.1 to 1.0. Fortunately implementors are unlikely to have taken
> much notice of that provision; but it does illustrate the dangers.

You're right!  I was so confident that what it said was what it should
say that I misread

    The value of the number entry should be such that all eligible
    xs:double values are equally likely to be chosen.

as saying that all real values in the interval are equally likely to be
chosen, i.e. that the values would be selected from a uniform
distribution over the interval.

At some point I will doubtless also want to select values from other
distributions, too, but I suspect that there are ways to make that
happen.

> If you want a completely uniform distribution, consider using the
> permute() option. But of course a uniform distribution is very far
> from random (it's well known that a truly random sequence contains far
> more duplicates than a typical person will expect).

> I think the idea of fn:random-number-generator() is that you can
> choose whether you want a repeatable sequence of random numbers or a
> different sequence each time. For the latter, use current-dateTime() as
> a seed.

My situation, I regret to say, is that if possible I want both: I would
like a given set of a hundred or a thousand or ten thousand simulation
runs to be repeatable, and I would also like each run in such a set to
have a different sequence.  (And, if possible, I would like each set to
vary, in a reproducible way.)

But if I use current-dateTime() or some other method to introduce
variation into the initial seed, I can of course record the seed used
and use it again if I want to replicate the simulations.  I suspect that
for my current code, which tends to use just the first two values of the
sequence of numbers generated by a given seed, I will want more
variation in the seed that current-dateTime() will give: a minimally
conforming processor has about 3.1E14 different dateTime values (and the
simulation runs in a set will vary only in minute, second, and
millisecond, so maybe 1E5 different values), which is a lot less than
4.5E18, which the Web tells me is roughly the number of double precision
values in the interval [0,1].

(I believe I saw a story once about a flawed online poker system whose
card-shuffling routine a 32-bit random number to shuffle the cards, with
the result that there were about 2^32 b	 4E9 possible hands, instead of
52! b	 8e67, and worse yet used milliseconds-since-midnight as the seed,
so those who broke the system just needed to try a few hundred or a few
thousand clock values to find the value that produced the hand they were
holding, at which point they could see everyone else's hand, too.  This
has led me to believe that -- especially if one is taking the first
number generated from a given seed -- it pays if the range of possible
seeds is about the same size as the range of possible results of the
random number generator, and/or the range of possible phenomena being
selected.)


> I'm sure you're right that you want a single "flat" sequence of random
> numbers, you don't want a branching sequence; and achieving a "flat"
> sequence when you're doing recursive tree traversal isn't
> straightforward. Using an accumulator is an interesting idea. I've
> previously used xsl:number level="any" to index into a sequence of
> pre-alllocated random numbers.

That seems like a feasible idea:  before calling apply-templates for a
given time slice I can generate a sequence of random numbers (it's easy
to know in advance how many are needed) and index into it in the way you
describe.

> You seem to be doing all the right things. You say the results are
> "disappointing", and I feel I'd like to know more about what that
> means.

That's a good question.  The main symptom so far is that the first
hundred times I ran the simulation, 80 of the runs produced the same
trivial result: a birth-and-death process in which the initial
individual died before reproducing - the runs differed on how long that
individual lived, so they were not completely identical, but the family
trees they produced were isomorphic: one node with birth and death
dates.

It is possible, of course, that that form of result is more probable
than I had expected.  So with a little effort I calculated the
probability that the birth and death rates I was using should produce
the result.  The probability *is* higher than I had expected, but if my
calculation is accurate it's about 0.32, not about 0.80.

> ...
> Specifically: a small change in the seed results in only a small
> change in the first value in the sequence. (The Saxon implementation
> calls Java's Random class passing a seed which is the hash code of the
> seed passed at the XPath level. For integers, the hash code of a small
> integer is the integer itself, which may well have something to do
> with it.)
>
> If I multiply the supplied seed by 987654321 before passing it to
> Java, the pattern looks a lot more "random":
>
> ...
>
> Alternatively, discard the first couple of items in the sequence.

An experiment I just ran suggests that discarding two values and taking
the third for each key does indeed lead to a much wider spread in the
resulting numbers; even discarding a single value helps a lot.  Perhaps
that will help.  Discarding ten appears to be more than is needed.

I will try one or more of these approaches and see what happens.  If and
when the simulation starts to produce results that match the
probabilities I am able to calculate for some simple cases (like the
one-individual case described above), then I will begin to have more
confidence in my simulation.

--
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
http://blackmesatech.com

Current Thread