Subject: Re: [xsl] XSLT match with regex what's the best current solution?|
From: Gunther Schadow <gunther@xxxxxxxxxxxxxxxxxxxxxx>
Date: Mon, 14 Jan 2002 22:11:34 -0500
as you can read in the regular expression thread http://www.biglist.com/lists/xsl-list/archives/200201/msg00488.html and further on, we are working on a tool which might be helpful for your purposes.
It is some mixture between regexes and an XSLT-like language, and we have called it regexslt. ...
Steven, thanks for your hint to the thread. I knew there was something. But I'm really looking for something that is XSLT (extension) rather than just 'like' xslt.
Of course I would not use regexes to match for XML/HTML tags, but I would want to use regexes in template match tests.
My model is really AWK, and it's interesting how the prinicple approach of XSLT is very similar to AWK (forget about global variables and the sequential flow of AWK rules for a moment.)
Some heading, with subphrases: An item without a bullet. Name = value pair. Property: value. Score = 7 (a = 1, b =3, c = 4). A full sentence that has so many words that it spans multiple lines. Sometimes we can't even trust whether people get the indention consistent.
<entry> <heading> Some heading <subheading>with subheading:</subheading> </heading> <item> <heading>An item without a bullet.</heading> <pair name='name' value='value pair.'/> <pair name='property' value='value.'/> <pair name='Score' value='7'> <pair name='a' value='1'/> <pair name='b' value='3'/> <pair name='c' value='4'/> </pair> <sentence>A full sentence that has so many words that it spans multiple lines.</sentence> <sentence>Sometimes we can't even trust whether people get the indention consistent.</sentence> </item> </entry>
In AWK I have a line-by-line matching (or whatever RS is set to) find the colons and indentions. I guess I could prime the XSLT process by sed-ing record terminators as:
<rec>Some heading, with subphrases:</rec> <rec> An item without a bullet.</rec> <rec> Name = value pair.</rec> <rec> Property: value.</rec> <rec> Score = 7 (a = 1, b =3, c = 4).</rec> <rec> A full sentence that has so many words that it spans </rec> <rec> multiple lines.</rec> <rec> Sometimes we can't even trust whether people get the</rec> <rec>indention consistent.</rec>
<xsl:template match="rec[regex:test(text(),'^\(.+\), \(.+\):$')]"> <entry> <heading> <xsl:value-of select='$1'/> <subheading><xsl:value-of select='$2'/></subheading> </heading> <xsl:apply-templates> </entry> </xsl:template>
O.K. that wouldn't work because the apply-templates thing could not go beyond the first line to match stuff into the content of the entry element that I just synthesized. So, I guess may be I'm on the completely wrong track now.
May be the initial <rec> elements put me on the wrong track. But if you have read this to this point you might see what I'm doing wrong. May be I should just stick with AWK, or may be I should do some call-out of XSLT to AWK.
What I want to do is incremental structure induction, i.e. the first run might only find the entries (e.g., blocks separated by blank lines), the next run would find the items, the next run would find the pairs, and the next pairs in the parenthese of the pairs value, etc.
So, the AWK-callouts would work on the text nodes of certain elements (beginning with the full file, then with the entries, items, pairs, etc.)
any more ideas appreciated, -Gunther
PS: I looked at this OmniMark thing, and I'm a bit turned away by how different it is from anything that we know from sed, awk, perl, lex, yacc, etc.
-- Gunther Schadow, M.D., Ph.D. gschadow@xxxxxxxxxxxxxxx Medical Information Scientist Regenstrief Institute for Health Care Adjunct Assistant Professor Indiana University School of Medicine tel:1(317)630-7960 http://aurora.regenstrief.org