RE: [xsl] How to parse text into words, phrases, clauses, sentences, and paragraphs

Subject: RE: [xsl] How to parse text into words, phrases, clauses, sentences, and paragraphs
From: "Michael Kay" <mike@xxxxxxxxxxxx>
Date: Thu, 7 Jun 2007 15:20:22 +0100
> This is my first problem. How to apply a template match ysing 
> the tokenize() function. And which order to apply (from 
> paragraph -> word or word -> paragraph).

It's generally easiest to do it top-down, I think.

Something like this:

<xsl:for-each select="tokenize(., $sentence-delimiter)">
  <sentence id="{position()}">
    <xsl:for-each select="tokenize(., $phrase-delimiter)">
      <phrase id="{position()}">
        <xsl:for-each select="tokenize(., $word-delimiter)">
          <word id="{position()}">
            <xsl:value-of select="."/>
> 
> > (d) doing the output numbering.
> 

I think you just need position() as shown above.

Sometimes you need to work bottom-up if the "sentences" can't be recognized
until you've identified the "words", for example if you want to avoid
treating "." as ending a sentence if it appears in a number. You're then
sometimes in the domain of positional grouping: create a long flat list of
words, and then group it into sentences using some kind of test applied to
the individual words.

Michael Kay
http://www.saxonica.com/

Current Thread