Subject: Re: [xsl] Splitting a paragraph into sentences and keep markup From: "Michael Kay mike@xxxxxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> Date: Sun, 24 Nov 2019 19:38:28 -0000 |
I think there are two basic approaches to this kind of problem. One is to convert the punctuation into tags, and then manipulate the resulting tree structure; the other is to turn the embedded tags into punctuation (like "[emphasis]two[/emphasis]") and then manipulate the content as a character string. My instinct, like Martin Honnen's, is to do the first. There are still complications, of course. For example if you're detecting end-of-sentence as [.?!] followed by a space or end-of-paragraph, then it's challenging to handle the case where the [.?!] is the last character in a text node but the text node isn't the last thing in the paragraph. (For example "sentence.<footnote>x</footnote> "). There's no easy answer to this (and natural language being what it is, there is no right answer either). Michael Kay Saxonica > On 24 Nov 2019, at 13:34, Rick Quatro rick@xxxxxxxxxxxxxx <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote: > > Hi All, > > I have a situation where I want to split a short paragraph into sentences and use them in different parts of my output. I am using <xsl:analyze-string> because I want to account for a sentence ending with a . or ?. This will work except if there are any children of the paragaph, like the <emphasis> in the second sentence. Can I split a paragraph into sentences and still keep the markup? > > Here is my input document: > > <?xml version="1.0" encoding="UTF-8"?> > <root> > <p>This has one sentence? Actually, it has <emphasis>two</emphasis>. No, it has three.</p> > </root> > > My stylesheet: > > <?xml version="1.0" encoding="UTF-8"?> > <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform <http://www.w3.org/1999/XSL/Transform>" > xmlns:xs="http://www.w3.org/2001/XMLSchema <http://www.w3.org/2001/XMLSchema>" > xmlns:rq="http://www.frameexpert.com <http://www.frameexpert.com/>" > exclude-result-prefixes="xs rq" > version="2.0"> > > <xsl:output indent="yes"/> > <xsl:strip-space elements="root"/> > > <xsl:template match="/root"> > <xsl:copy> > <xsl:apply-templates/> > </xsl:copy> > </xsl:template> > > <xsl:template match="p"> > <xsl:variable name="sentences" select="rq:splitParagraphIntoSentences(.)"/> > <p><xsl:value-of select="$sentences[1]"/></p> > <note>Something in between.</note> > <p><xsl:value-of select="$sentences[position()>1]"/></p> > </xsl:template> > > <xsl:function name="rq:splitParagraphIntoSentences"> > <xsl:param name="paragraph"/> > <xsl:analyze-string select="$paragraph" regex=".+?[\.\?](\s+|$)"> > <xsl:matching-substring> > <sentence><xsl:value-of select="replace(.,'\s+$','')"/></sentence> > </xsl:matching-substring> > </xsl:analyze-string> > </xsl:function> > </xsl:stylesheet> > > My output: > > <?xml version="1.0" encoding="UTF-8"?> > <root> > <p>This has one sentence?</p> > <note>Something in between.</note> > <p>Actually, it has two. No, it has three.</p> > </root> > > What I want is this: > > <?xml version="1.0" encoding="UTF-8"?> > <root> > <p>This has one sentence? </p> > <note>Something in between.</note> > <p>Actually, it has <emphasis>two</emphasis>. No, it has three. </p> > </root> > > Any suggestions will be appreciated. > > Rick > XSL-List info and archive <http://www.mulberrytech.com/xsl/xsl-list> > EasyUnsubscribe <http://lists.mulberrytech.com/unsub/xsl-list/293509> (by email <>)
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] Splitting a paragraph int, Rick Quatro rick@xxx | Thread | Re: [xsl] Splitting a paragraph int, Dimitre Novatchev dn |
Re: [xsl] Best practice for typing?, Syd Bauman s.bauman@ | Date | Re: [xsl] Splitting a paragraph int, Dimitre Novatchev dn |
Month |