Re: OT: XML Server dream

Subject: Re: OT: XML Server dream
From: "Liam R. E. Quin" <liamquin@xxxxxxxxxxxx>
Date: Mon, 25 Oct 1999 13:23:29 -0400 (EDT)
I've lost count of the number of systems I have seen storing SGML or XML
in a database.

I have seen 3 basic approaches.  Of these, one approach almost never
works, or if it does, dramatically increases sales of snacks and
coffee while people wait for a response when it's used.

The approaches are
(1) decompose every element into a field in a relational or OO database.
    With a relational database, this always seems to end up sad.
    A wait of 30 seconds to a minute or more on half a million dollars
    worth of server hardware is pathetic.

(2) decompose down to paragraphs, but store mixed content as blobs.
    much faster, for most people, but you lose the ability to find
    things like embedded part numbers that might have been the reason
    for using the database in the first place.

(3) store documents in flat files.  Use the database to manage them,
    and to store metadata.  Use external text retrieval.
    Fast performance (sub-second response on a million-document
    database with a middling SPARC server is plausible, or a few
    seconds for a more complex text search).

There are many varioations on these.  It is possible to use an
object oriented database in such a way as to give good response, but
it is difficult.  It's possible to get good response for a specific
application with a relational database too.  But if you compare the
performance with Oracle (the market leader) with mySQL (free, does not
support transactions, rollback, cursors), you see that you are not
paying for performance.

Marc Rochkind mentions [1] a database that did 40,000 or more transactions
per second on a PDP-11, but I doubt there was locking or rollback or
journalling.  My own text retrieval packaeg can do several million
database operations a second, but again without locking.

Luckily, you don't need locking and rollback below the "file" level
for most XML applications -- where "file" is the granularity at which
documents are saved and/or edited.

If performance isn't a major issue, though, I agree that using te
database is often the simplest way.  Some databases even support
searching of text fields and BLOBS these days, which makes it more
attractive.

Don't underestimage grep, by the way -- I've seen a good version of
gerp search over 50 megabytes a second, on a fairly low-end
SPARC system (an SS10, you can't buy them that slow now).  You won't
get that performance on a PC, usually, because the I/O just isn't
there, even with a SCSI PCI system, but it's coming.  And two PCs
in parallel are less than half the price of a SPARC Ultra.

You have to look at what staff you have.

If you have Unix programmers, the grep solution may be a good one,
once you deal with normalising white-space.

If you have SQL programmers, then any problem will seem to have a
solution involving a database :-) and that's the way to go.

It's better to have a slow system that works, and that you can fix
and extend, than a super-duper quantum rocket-science thingy that
99% works but no-one can do the last 1% unless you hire a wizard.
There _are_ no wizards, only people.

The best solution is the one that works and can be supported in
your environment.

Lee

-- 
Liam Quin, Barefoot Computing, Toronto;  The barefoot agitator
l i a m q u i n     at    i n t e r l o g    dot   c o m
Ankh on irc.sorcery.net, ankle5/Ankle{MD} on DALnet


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread