Re: [xsl] XSLT match with regex what's the best current solution?

Subject: Re: [xsl] XSLT match with regex what's the best current solution?
From: Gunther Schadow <gunther@xxxxxxxxxxxxxxxxxxxxxx>
Date: Mon, 14 Jan 2002 22:11:34 -0500
Steven Noels wrote:

as you can read in the regular expression thread
http://www.biglist.com/lists/xsl-list/archives/200201/msg00488.html and
further on, we are working on a tool which might be helpful for your
purposes.

It is some mixture between regexes and an XSLT-like language, and we
have called it regexslt. ...


Steven, thanks for your hint to the thread. I knew there was something.
But I'm really looking for something that is XSLT (extension) rather
than just 'like' xslt.

Of course I would not use regexes to match for XML/HTML tags, but
I would want to use regexes in template match tests.

My model is really AWK, and it's interesting how the prinicple
approach of XSLT is very similar to AWK (forget about global
variables and the sequential flow of AWK rules for a moment.)

A common example would be:

Some heading, with subphrases:
  An item without a bullet.
    Name = value pair.
    Property: value.
    Score = 7 (a = 1, b =3, c = 4).
    A full sentence that has so many words that it spans
        multiple lines.
    Sometimes we can't even trust whether people get the
indention consistent.

This should be marked up as

<entry>
  <heading>
     Some heading
     <subheading>with subheading:</subheading>
  </heading>
  <item>
    <heading>An item without a bullet.</heading>
       <pair name='name' value='value pair.'/>
       <pair name='property' value='value.'/>
       <pair name='Score' value='7'>
           <pair name='a' value='1'/>
           <pair name='b' value='3'/>
           <pair name='c' value='4'/>
       </pair>
       <sentence>A full sentence that has so many words that it spans
        multiple lines.</sentence>
       <sentence>Sometimes we can't even trust whether people get the
indention consistent.</sentence>
  </item>
</entry>

In AWK I have a line-by-line matching (or whatever RS is set to)
find the colons and indentions. I guess I could prime the XSLT
process by sed-ing record terminators as:

<rec>Some heading, with subphrases:</rec>
<rec>  An item without a bullet.</rec>
<rec>    Name = value pair.</rec>
<rec>    Property: value.</rec>
<rec>    Score = 7 (a = 1, b =3, c = 4).</rec>
<rec>    A full sentence that has so many words that it spans </rec>
<rec>        multiple lines.</rec>
<rec>    Sometimes we can't even trust whether people get the</rec>
<rec>indention consistent.</rec>

And then I could use templates just like AWK rules:

<xsl:template match="rec[regex:test(text(),'^\(.+\), \(.+\):$')]">
  <entry>
     <heading>
        <xsl:value-of select='$1'/>
        <subheading><xsl:value-of select='$2'/></subheading>
     </heading>
     <xsl:apply-templates>
  </entry>
</xsl:template>

O.K. that wouldn't work because the apply-templates thing could not
go beyond the first line to match stuff into the content of the
entry element that I just synthesized. So, I guess may be I'm on
the completely wrong track now.

May be the initial <rec> elements put me on the wrong track. But
if you have read this to this point you might see what I'm
doing wrong. May be I should just stick with AWK, or may be I
should do some call-out of XSLT to AWK.

What I want to do is incremental structure induction, i.e. the
first run might only find the entries (e.g., blocks separated
by blank lines), the next run would find the items, the next
run would find the pairs, and the next pairs in the parenthese
of the pairs value, etc.

So, the AWK-callouts would work on the text nodes of certain
elements (beginning with the full file, then with the entries,
items, pairs, etc.)

any more ideas appreciated,
-Gunther


PS: I looked at this OmniMark thing, and I'm a bit turned away by how different it is from anything that we know from sed, awk, perl, lex, yacc, etc.

--
Gunther Schadow, M.D., Ph.D.                    gschadow@xxxxxxxxxxxxxxx
Medical Information Scientist      Regenstrief Institute for Health Care
Adjunct Assistant Professor        Indiana University School of Medicine
tel:1(317)630-7960                         http://aurora.regenstrief.org



XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list


Current Thread