RE: [xsl] How to parse text into words, phrases, clauses, sentences, and paragraphs

Subject: RE: [xsl] How to parse text into words, phrases, clauses, sentences, and paragraphs
From: mark bordelon <markcbordelon@xxxxxxxxx>
Date: Thu, 7 Jun 2007 07:04:41 -0700 (PDT)
--- Michael Kay <mike@xxxxxxxxxxxx> wrote:
> You don't really make it clear where you are having
> difficulty. There seem
> to be four separate problems here:

Mike, Thanks for helping me even break this down. THis
is definitely something I can and want to do myself.
Just need the initial hints.

> (a) translating your concepts, such as "words" and
> "sentences" into precise
> specifications
> (b) translating these specifications into regular
> expressions

Got these. 
E.g. the specification for "word" could be [^ '-]*

> 
> (c) using these regular expressions within a
> stylesheet, for example as an
> argument to the tokenize() function or the
> xsl:analyze-string instruction.
> 

This is my first problem. How to apply a template
match ysing the tokenize() function. And which order
to apply (from paragraph -> word or word ->
paragraph).

> (d) doing the output numbering.

I haven't a clue how this would be done, either way.

> 
> The fourth problem seems quite unrelated to the
> others. Of the other three,
> I'm reluctant to launch into answering without
> knowing which of the three
> steps you need help with. (Generally I think most
> people answering on this
> list adopt the approach of trying to help you solve
> your problem, rather
> than doing the work for you.)

After any initial hints, I would and could be able to
do the rest of the work myself.

> 
> Incidentally, regular expressions are an XSLT 2.0
> feature so I assume you're
> looking for XSLT 2.0 solutions.
> 

That is an issue. IS there any way to do this without
regular expressions?


> Michael Kay
> http://www.saxonica.com/
> 
> > -----Original Message-----
> > From: mark bordelon
> [mailto:markcbordelon@xxxxxxxxx] 
> > Sent: 06 June 2007 22:52
> > To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
> > Subject: [xsl] How to parse text into words,
> phrases, 
> > clauses, sentences, and paragraphs
> > 
> > Hey XML gurus,
> > 
> > Still somewhat new to XML/XSL and need some help
> getting 
> > started on how to use regular expressions and
> tokens in 
> > English text to transform it into an XML document
> marked up for:
> > 
> > 1.words (delimited by WS, excluding any external 
> > 2.punctuation, but allowing internal punctuation)
> 3.phrases 
> > (delimited by the comma) 4.clauses (delimited by
> colon or 
> > semicolon) 5.sentences (delimited by the period, 
> > question-mark, or  exclamation mark) 6.paragraphs
> (delimited 
> > by a line break)
> > 
> > Also ideal would be to assign sequenced id's to
> every tag, 
> > either in a running consecutive style from
> beginning to end, 
> > or repeating from 1 for every level of nesting. 
> > 
> > In more concrete terms,
> > 
> > To transfrom this text:
> > 
> > THOU still unravish'd bride of quietness,  Thou
> foster-child 
> > of Silence and slow Time, Sylvan historian, who
> canst thus 
> > express  A flowery tale more sweetly than our
> rhyme:
> > What leaf-fringed legend haunts about thy shap  Of
> deities or 
> > mortals, or of both,  In Tempe or the dales of
> Arcady?
> >  What men or gods are these? What maidens loth?
> > What mad pursuit? What struggle to escape?
> >  What pipes and timbrels? What wild ecstasy?
> > 
> > into this XML: (using indexing that renumbers for
> each
> > sub-group)
> > 
> > <para id=1>
> >  <sent id=1>
> >   <clause id=1>
> >    <phrase id=1>THOU still unravish'd bride of
> quietness,</phrase>
> >    <phrase id=2>Thou foster-child of Silence and
> slow Time,</phrase>
> >    <phrase id=3>Sylvan historian,</phrase>
> >    <phrase id=4> who canst thus express A flowery
> tale more 
> > sweetly than our rhyme</phrase>:
> >   </clause>
> >   <clause id=2>
> > What leaf-fringed legend haunts about thy shape Of
> deities or 
> > mortals,</phrase>
> >    <phrase id=1> or of both,</phrase>
> >    <phrase id=2> In Tempe or the dales of Arcady?
> >   </clause>
> >  </sent>
> >  <sent id=2>What men or gods are these?</sent> 
> <sent 
> > id=3>What maidens loth?</sent>  <sent id=4>What
> mad 
> > pursuit?</sent>  <sent id=5>What struggle to
> escape?</sent>  
> > <sent id=6>What pipes and timbrels?</sent>  <sent
> id=7>What 
> > wild ecstasy?</sent> </para>
> > 
> > 
> > or into this XML: (using indexing that is
> continuous per tag)
> > 
> > <para id=1>
> >  <sent id=1>
> >   <clause id=1>
> >    <phrase id=1>THOU still unravish'd bride of
> quietness,</phrase>
> >    <phrase id=2>Thou foster-child of Silence and
> slow Time,</phrase>
> >    <phrase id=3>Sylvan historian,</phrase>
> >    <phrase id=4> who canst thus express A flowery
> tale more 
> > sweetly than our rhyme</phrase>:
> >   </clause>
> >   <clause id=2>
> > What leaf-fringed legend haunts about thy shape Of
> deities or 
> > mortals,</phrase>
> >    <phrase id=5> or of both,</phrase>
> >    <phrase id=6> In Tempe or the dales of Arcady?
> >   </clause>
> >  </sent>
> >  <sent id=2>What men or gods are these?</sent> 
> <sent 
> > id=3>What maidens loth?</sent>  <sent id=4>What
> mad 
> > pursuit?</sent>  <sent id=5>What struggle to
> escape?</sent>  
> > <sent id=6>What pipes and timbrels?</sent>  <sent
> id=7>What 
> > wild ecstasy?</sent> </para>
> > 
> > Surely this has been done before. I have searched
> through 
> > archives and have not found anything, probably
> since I am 
> > searching using the wrong terminology.
> > 
> > Would really appreciate the help as it would give
> me insight 
> > into using regular expressions and sequencing in
> XSL.
> > 
> > Thanks in advance
> > 
> > Mark Bordelon
> > 
> > 
> > 
> >  
> >
>
______________________________________________________________
> > ______________________
> > Need Mail bonding?
> > Go to the Yahoo! Mail Q&A for great tips from
> Yahoo! Answers users.
> >
>
http://answers.yahoo.com/dir/?link=list&sid=396546091

Current Thread