Subject: RE: [xsl] segmenting a paragraph From: "Michael Kay" <mike@xxxxxxxxxxxx> Date: Tue, 2 Oct 2007 09:36:58 +0100 |
When you need to apply regex matching to text that crosses node boundaries, in the past two approaches have been proposed: (a) create a string in which the node boundaries are represented by some recognizable textual markup (you could use saxon:serialize()), then apply the regex processing, then reinstate the node structure (e.g. by using saxon:parse()). (b) do a deep copy, while processing each of the text nodes to replace the significant features (such as end of sentence) by nodes (e.g. an <end-of-sentence/> element). Then apply positional grouping techniques to transform this into your target structure. Neither is particularly easy, I'm afraid. Michael Kay http://www.saxonica.com/ > -----Original Message----- > From: Christian Wittern [mailto:cwittern@xxxxxxxxx] > Sent: 02 October 2007 09:05 > To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx > Subject: [xsl] segmenting a paragraph > > Dear XSL-list readers, > > In trying to solve the following problem I am seeking your help: > I want to segment paragraphs in a text, so that sentences are > enclosed in a <s> element and within the sentences, words > between interpunction are within <seg> elements. > > So far, I have been capturing the content of <p> in a string > and then using two nested <xsl:analyze-string> blocks with > regexes, which work nicely and do what I want. Now I > discovered that there are <note> elements with additional > markup in some paragraphs, which get lost in this process. > However, I really want to leave these notes alone, as they are. So: > > <p>Some text. Some more text, with a comma. <note>This > stuff, how boring</note></p> > > should look like: > > <p><s><seg>Some text.</seg></s><s><seg>Some more > text,</seg><seg> with a comma.</seg></s><note>This stuff, how > boring</note></p> > > I wonder how I tell the processor to leave the note stuff alone? > > Any help appreciated, > > Christian > > -- > Christian Wittern > Institute for Research in Humanities, Kyoto University > 47 Higashiogura-cho, Kitashirakawa, Sakyo-ku, Kyoto 606-8265, JAPAN
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] segmenting a paragraph, G. Ken Holman | Thread | Re: [xsl] segmenting a paragraph, Andrew Welch |
Re: [xsl] segmenting a paragraph, G. Ken Holman | Date | Re: [xsl] segmenting a paragraph, Andrew Welch |
Month |