RE: [xsl] Parsing plain text - xml application specifying parser

Subject: RE: [xsl] Parsing plain text - xml application specifying parser
From: "Michael Kay" <mike@xxxxxxxxxxxx>
Date: Sun, 11 Sep 2005 19:59:07 +0100
There are plenty of people working on tools for marking up natural language
text in XML, and plenty of tools available: for an example
http://www.ltg.ed.ac.uk/software/pos/index.html. XSLT might well be used
within such tools, but not in a central role. Apart from anything else, the
grammar of natural languages is much richer than the grammar that regular
expressions can handle.

I'd suggest you do some reading on natural language analysis if you are
interested in this area. As far as I can tell, the use of XML for marking up
linguistic texts is very widespread, though many of the biggest projects
actually predate XML. There are people on this list who know far more about
it than I do.

Michael Kay
http://www.saxonica.com/

 

> -----Original Message-----
> From: Noah Scales [mailto:noahjscales@xxxxxxxxx] 
> Sent: 11 September 2005 08:23
> To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
> Subject: [xsl] Parsing plain text - xml application specifying parser
> 
> Hi.
> 
> Is it feasible to use to specify a parser that, when
> translated into XSLT 2.0, turns plain text into xml
> according to the specification? Is something like this
> expected for XSL 3.0, skipping the use of a separate
> XML application?
> 
> My searches on google and through this list's archives
> didn't provide me any information on this approach. My
> next step is to hack at it myself, but my knowledge of
> how parsers work is minimal. If XSLT can mimic a
> parser, though, this might work as a two-step process:
> 
> parser_specification.xml + parser_application.xsl ->
> parser.xsl 
> 
> parser.xsl + plain.txt -> fully_parsed.xml
> 
> Maybe the plain.txt is accessed through xpath's
> document() function applied to a parameter passed to
> the parser.xsl file when its processed. 
> 
> This idea was sparked for me by:
> - reading an online article (that I've lost) that
> discusses how an xml file preserves the parse tree
> using its tags
>  - Michael Kay's writing (in his XSLT 2.0 Programmer's
> Reference) about analyzing plain text for hidden
> structure using XSLT 2.0 regex.
> 
> It seems like a natural fit to me that XSLT could do
> this directly, turning plain text into XML without
> difficulty. I wouldn't be surprised if this approach
> (or something much better) is slated for a later XSLT
> release.
> 
> In case it helps explain what I mean, below is an
> artificial example parser_specification.xml file that
> transforms an input plain.txt file into a
> fully_parsed.xml file. I'm just a student, not a
> programming expert. If the example is raw or just
> plain awful, sorry.
> 
> Anyway, I'll appreciate any information that anyone
> can provide.
> 
> -Noah
> -------------------------------------------------------
> 
> 
> -----------parser_specification.xml--------------------
> <?xml version="1.0"?>
> <specification ignore-white-space="yes">
> 
> <first-rule name="entities">
> <either_or><rule name="identifier_listing"
> /><or/><rule name="descriptor_listing" /></either_or>
> </first-rule>
> 
> <rule name="identifier_listing">Each <rule
> name="entity" /> is identified by <optional>the
> combination of</optional><rule name="descriptors"
> /><optional> and <rule name="descriptors"
> /></optional>
> </rule>
> 
> <rule name="descriptor_listing">About each <rule
> name="entity" />, we can remember <rule
> name="descriptors" count="1+" /><optional> and <rule
> name="descriptors" /></optional>
> </rule>
> 
> <rule name="descriptors" tag-output="no">
> its <rule name="descriptor" count="1"
> tag-output="yes"/><either_or>,<or/>.</either_or>
> </rule>
> 
> <rule name="descriptor">
> <either_or><rule name="entity" or-preference="1"
> /><or/><rule name="attribute" or-preference="2"
> /></either_or>
> </rule>
> 
> <rule
> name="entity"><either_or>cow<or/>herd<or/>farm<or/>herd-owner<
> or/>farm-owner</either_or>
> </rule>
> 
> <rule name="attribute"><regex value="\w[[:alnum:]*\w"
> />
> </rule>
> 
> </specification>
> -------------------------------------------------------
> 
> 
> ----------------------plain.txt------------------------
> About each cow, we can remember its name, its breed,
> its weight, and its herd.
> Each cow is identified by the combination of its name,
> and its herd.
> About each herd, we can remember its name, its
> herd-owner, and its farm.
> Each herd is identified by the combination of its
> name, and its farm.
> About each farm we can remember its farm-owner, its
> name, and...
> .
> .
> .
> -------------------------------------------------------
> 
> 
> ------------------fully_parsed.xml---------------------
> <?xml version="1.0"?>
> <entities>
> 
> <descriptor_listing>About each <entity>cow</entity>,
> we can remember its <attribute>name</attribute>, its
> <attribute>breed</attribute>, its
> <attribute>weight</attribute>, and its
> <entity>herd</entity>.
> </descriptor_listing>
> 
> <identifier_listing>Each <entity>cow</entity> is
> identified by the combination of its
> <attribute>name</attribute>, and its
> <entity>herd<entity>.</identifier_listing>
> 
> <descriptor_listing>About each <entity>herd</entity>,
> we can remember its <attribute>name</attribute>, its
> <entity>herd-owner</entity>, and its
> <entity>farm</entity>.</descriptor_listing>
> 
> <identifier_listing>Each <entity>herd</entity> is
> identified by the combination of its
> <attribute>name</attribute>, and its <entity>farm
> </entity>.</identifier_listing>
> 
> <descriptor_listing>About each <entity>farm</entity>
> we can remember its <entity>farm-owner</entity>, its
> <attribute>name</attribute>, and 
> .
> .
> .
> </entities>
> -------------------------------------------------------
> 
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around 
> http://mail.yahoo.com 

Current Thread