Re: [xsl] An efficient XSLT program that searches a large XML document for all occurrences of a string?

Subject: Re: [xsl] An efficient XSLT program that searches a large XML document for all occurrences of a string?
From: "Dimitre Novatchev dnovatchev@xxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Fri, 3 May 2024 15:57:10 -0000
 On Fri, May 3, 2024 at 1:31b/AM Michael Kay michaelkay90@xxxxxxxxx <
xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote:

>
>
> On 3 May 2024, at 00:25, Dimitre Novatchev dnovatchev@xxxxxxxxx <
> xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote:
>
>
> If I were related to any activity that collects and structures such large
> quantities of data, I would envisage splitting and keeping this data into
> smaller, manageable chunks, wherever possible.
>
>
> That's a good recommendation, but it's a workaround for the fact that the
> technology isn't as scalable as we would like.
>
> If a system offers you the opportunity to get an XML report of all the
> transactions occurring between two dates at a range of locations, then
> sooner or later someone is going to submit a query that delivers a 5Gb
> report, and in an ideal world, they wouldn't have to do things differently
> just because the amount of data has exceeded some arbitrary threshold.
>
> Growth in data size tends to creep up on you. The log files that we keep
> of licenses issued to Saxon users are now much larger than we ever
> envisaged when we started. You don't want to have to change the design just
> because things have grown incrementally. We did change the design: we
> switched to one XML file per year. But it would be nice if we weren't
> forced into that by technology limitations.
>
>
Agreed, but this sounds as a good excuse to leave things as they are.

As a single source of data exceeds all practical data sizes, its sequential
processing starts to take proportionally longer and longer times. So, it
might require many hours of processing (streaming in this particular case)
only to find data that is at the end of this sequential resource.

A proactive approach would be to periodically (imagine a nightly background
job) split the data into manageable chunks (and index them).

So, on one side we can have streaming of a monolithic resource, that could
take many hours/days/weeks.

On the other side we have almost instantaneous processing that focuses
immediately just on the small chunk of data we are interested in.

Even if we need to process a large set of such data-chunks, we could employ
a convenient map/reduce algorithm that will not exceed the maximum time
needed for processing just one chunk of data. This is not possible with
pure sequential processing.

And this is why horizontal scaling is most often the preferred and more
efficient solution when compared with vertical scaling.

Thanks,
Dimitre

On Fri, May 3, 2024 at 1:31b/AM Michael Kay michaelkay90@xxxxxxxxx <
xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote:

>
>
> On 3 May 2024, at 00:25, Dimitre Novatchev dnovatchev@xxxxxxxxx <
> xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote:
>
>
> If I were related to any activity that collects and structures such large
> quantities of data, I would envisage splitting and keeping this data into
> smaller, manageable chunks, wherever possible.
>
>
> That's a good recommendation, but it's a workaround for the fact that the
> technology isn't as scalable as we would like.
>
> If a system offers you the opportunity to get an XML report of all the
> transactions occurring between two dates at a range of locations, then
> sooner or later someone is going to submit a query that delivers a 5Gb
> report, and in an ideal world, they wouldn't have to do things differently
> just because the amount of data has exceeded some arbitrary threshold.
>
> Growth in data size tends to creep up on you. The log files that we keep
> of licenses issued to Saxon users are now much larger than we ever
> envisaged when we started. You don't want to have to change the design just
> because things have grown incrementally. We did change the design: we
> switched to one XML file per year. But it would be nice if we weren't
> forced into that by technology limitations.
>
> Michael Kay
> Saxonica
>
> XSL-List info and archive <http://www.mulberrytech.com/xsl/xsl-list>
> EasyUnsubscribe <http://lists.mulberrytech.com/unsub/xsl-list/782854> (by
> email <>)

Current Thread