Re: [xsl] Using XSLT to build an index

Subject: Re: [xsl] Using XSLT to build an index
From: "G. Ken Holman" <gkholman@xxxxxxxxxxxxxxxxxxxx>
Date: Sun, 30 Oct 2011 18:07:32 -0400
At 2011-10-30 14:47 -0700, Mark wrote:
The list archives did not seem to contain an
XSLT stylesheet that could index an XML file,
but I may have missed it. Is it practical to
write my own XSLT 2 indexing stylesheet? If so,
I have a bilingual XML file that I want to index.

Where you simply want all words, except your stop words, collected to automate the index generation, I've never been successful with automated indexing myself. For my books I've authored the components of the index, and then pointed to those components from within the code.

My assumptions are that I must get rid of the
punctuation properly, then isolate the words,
sort them, remove stop words, and so on. To get
started, I need a bit of help. All of the
phrases are found in two attributes: @czech and @eng.

Three questions:
(1) I am aware from Michaelbs book that regex
expressions may be used in the replace()
function, but I do not know how to write that
regex expression. I would like to remove all the
punctuation from a phrase as follows: for
everything except a hyphen [-], replacement
should be with an empty string; the hyphen
should be replaced with a single space.

Simple character removal can be done with translate() in XSLT 1 or 2 rather than using a regular expression:

translate($inValue,'-,#.$%',' ')

... where the first argument is your input, the
second starts with a "-" and then you put
anything else in there as characters to remove,
the third indicates the hyphen becomes a space and the rest are to be
removed.

(2) I assume that to get rid of extra spaces (if
any), I can use a construct like:
normalize-space(replace(@czech, bsome regex expressionb)).

That will reduce all sequences of white-space characters to a single space.


(3) I assume that
tokenize(normalize-space(replace(@czech, 'some
regex expression'))) will permit me to write out
a list of the words found in those attributes to
an XML document. I am not completely clear as to
what tokenize() returns, or how to access that return.

tokenize() returns a sequence. But the input is only a single string.


Actually, you want to turn the expression
inside-out to get a list of words from the entire
document then something along these lines should work:

distinct-values(
(//@czech)/tokenize(translate(normalize-space(.),'-,$%.#',' '))  )

That gives you a sequence of unique words.  Can
you work from that in order to do the
hyperlinking, or do you need help there as
well?  Remember you will have to do the same
translation when creating your links, so perhaps
you should have a user function:

mark:words(.) as tokenize(translate(normalize-space($arg),'-,$%.#',' '))

... then use:

(//@czech)/mark:words(.)

... then when creating your links you'll have the
function available to ensure the same tokenizing is done at the point in
time.

I hope this helps.

. . . . . . . . . . Ken


-- Contact us for world-wide XML consulting and instructor-led training Crane Softwrights Ltd. http://www.CraneSoftwrights.com/s/ G. Ken Holman mailto:gkholman@xxxxxxxxxxxxxxxxxxxx Google+ profile: https://plus.google.com/116832879756988317389/about Legal business disclaimers: http://www.CraneSoftwrights.com/legal

Current Thread