Hi Jim,
At 10:35 AM 4/19/01, Mike wrote:
> I am trying to output the first n sentences of a node. I have
> tried using
> for-each with a conditional to stop output but have had no luck.
>
> Given the following XML fragment, what is the best way to
> output only the
> first n sentences? Note that the node has both text and child nodes.
>
Write a recursive template that takes the text and n as parameters; in this
template,
if n>0, output the first sentence (using substring-before), then make a
recursive call on the the same template, passing the remaining text (using
substring-after) and n-1 as the parameters.
This will work assuming you have identified some dependable way to delimit
sentences in your data. You might assume that the presence of a character
"." will indicate the end of a sentence. This is fine ... but what about
sentences that happen to contain the string "...", or that end with a
question mark? (Or what about sentences that appear with other kinds of
punctuation?!)
Identifying what is actually a "sentence" is actually a difficult question
in text processing, not easily tractable, which is why applications that
require processing based on sentences will be much easier if you have
markup embedded that tells you what's a sentence, and what's not. Your
problem would be fairly trivial in XSLT if your input were something like:
<summary>
<s>It is best to start a new <span class="highlight">message</span> for a
new thread.</s>
<s>Do not start a new thread by replying to an unrelated <span
class="highlight">message</span> and just changing the subject line, since
the header of your <span class="highlight">message</span> will contain
references to the previous <span class="highlight">message</span> and your
new <span class="highlight">message</span> will appear in the archive as
one of the replies to the original <span class="highlight">message</span>.</s>
</summary>
If you don't have the option of changing the way your input is structured,
Mike's solution of processing text content recursively is the only option
-- and might be "good enough for government work" (as is sometimes said).
But the presence of element nodes in mixed content (such as your embedded
<span> elements) makes this much harder, unless you can just throw them
away. In theory I suppose it could be done, but the code is going to be
pretty ugly, especially if you allow for the possibility that a "sentence"
could end *inside* one of the <span> children....
Any intrepid XSLT coders want to tackle that?
Good luck,
Wendell
======================================================================
Wendell Piez mailto:wapiez@xxxxxxxxxxxxxxxx
Mulberry Technologies, Inc. http://www.mulberrytech.com
17 West Jefferson Street Direct Phone: 301/315-9635
Suite 207 Phone: 301/315-9631
Rockville, MD 20850 Fax: 301/315-8285
----------------------------------------------------------------------
Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list