RE: [xsl] Truncating output of a node

Subject: RE: [xsl] Truncating output of a node
From: Wendell Piez <wapiez@xxxxxxxxxxxxxxxx>
Date: Thu, 19 Apr 2001 14:56:31 +0100
Hi Jim,

At 10:35 AM 4/19/01, Mike wrote:
> I am trying to output the first n sentences of a node. I have
> tried using
> for-each with a conditional to stop output but have had no luck.
> Given the following XML fragment, what is the best way to
> output only the
> first n sentences? Note that the node has both text and child nodes.
Write a recursive template that takes the text and n as parameters; in this
if n>0, output the first sentence (using substring-before), then make a
recursive call on the the same template, passing the remaining text (using
substring-after) and n-1 as the parameters.

This will work assuming you have identified some dependable way to delimit sentences in your data. You might assume that the presence of a character "." will indicate the end of a sentence. This is fine ... but what about sentences that happen to contain the string "...", or that end with a question mark? (Or what about sentences that appear with other kinds of punctuation?!)

Identifying what is actually a "sentence" is actually a difficult question in text processing, not easily tractable, which is why applications that require processing based on sentences will be much easier if you have markup embedded that tells you what's a sentence, and what's not. Your problem would be fairly trivial in XSLT if your input were something like:

<s>It is best to start a new <span class="highlight">message</span> for a
new thread.</s>
<s>Do not start a new thread by replying to an unrelated <span
class="highlight">message</span> and just changing the subject line, since the header of your <span class="highlight">message</span> will contain references to the previous <span class="highlight">message</span> and your new <span class="highlight">message</span> will appear in the archive as one of the replies to the original <span class="highlight">message</span>.</s>

If you don't have the option of changing the way your input is structured, Mike's solution of processing text content recursively is the only option -- and might be "good enough for government work" (as is sometimes said). But the presence of element nodes in mixed content (such as your embedded <span> elements) makes this much harder, unless you can just throw them away. In theory I suppose it could be done, but the code is going to be pretty ugly, especially if you allow for the possibility that a "sentence" could end *inside* one of the <span> children....

Any intrepid XSLT coders want to tackle that?

Good luck,

Wendell Piez                            mailto:wapiez@xxxxxxxxxxxxxxxx
Mulberry Technologies, Inc.      
17 West Jefferson Street                    Direct Phone: 301/315-9635
Suite 207                                          Phone: 301/315-9631
Rockville, MD  20850                                 Fax: 301/315-8285
  Mulberry Technologies: A Consultancy Specializing in SGML and XML

XSL-List info and archive:

Current Thread