Re: [xsl] The Oxford Comma - A Gift Worth Atleast 5 Cents
Subject: Re: [xsl] The Oxford Comma - A Gift Worth Atleast 5 Cents|
From: Wendell Piez <wapiez@xxxxxxxxxxxxxxxx>
Date: Fri, 20 Jun 2008 16:46:53 -0400
As it relates to XSL, the topic of natural language processing is in
scope. In the previous messages in this thread, it appeared to me
that the discussion had quickly and decisively moved away from XSLT
towards more philosophical questions -- which themselves weren't
actually being engaged with much of the sort of discipline that might
warrant the hope that they would come back to XSLT. (I admit we
digress sometimes, but we really shouldn't, and mostly when we do
it's with at least an eye on why we're here.)
At 01:43 PM 6/20/2008, you wrote:
Thank you for ensuring that we keep in scope. I instigated this
with an almost caricaturistic remark on human languages, at the end
of my initial post. What I was thinking, that was not that clear
then, was that we have to process human languages a lot. I did not
check, but it is possible that most of processing and issues that
come through this list are language processing related and it does
not seem like it is going to get any better soon. There are also
many documents, in many languages, and we need to process them, all
of the time, more and more.
I think it's necessary to make a distinction between processing
natural language inputs and generating natural language.
<item>Cisco IP phone end user training</item>
<item>Cisco attendant console operator training</item>
<item>Cisco call center agent training</item>
"Cisco IP phone end user training, Cisco attendant console operator
training, and Cisco call center agent training"
It makes a big difference whether A is to be transformed to B, or B
is to be transformed to A.
B to A is indeed the concern of an entire subdiscipline of
A to B is tractable in XSLT, and remains so (although it becomes more
complex) when working with arbitrary sets of items. But depending on
the requirements, one might have to adjust not only for different
numbers of items, but for anomalous inputs of various kinds. (For
example, one might want to defend against duplicate items in the set.
Or, if one avoids the Oxford comma in generating English, does one
restore it when given a final or penultimate item containing the word "and"?)
XSLT2 is possibly our best tool so far.
For A to B, probably (though I note that your rewrite might as well
be XQuery). For B to A, probably not outside simple cases (though
without looking into it, I couldn't actually tell you what the
state-of-the-art NLP parsers are using). Someone could be ready to
prove me wrong, which would be very cool.
On this this track, we are primarily looking at processing the
Oxford comma, in English, with XSLT. As here, by law, we have to
support and process at least both English and French, back and
forth too, on input and output, and the rules are different for
each case, I may be a bit sensitive on the subject. I am sure that
Ronnie took some serious time to resolve the stylesheet like he did
and I only tried to optimize it further, for English in XSLT2. I
am not sure if I succeeded or better, how we can further improve on
this case, but I am sure that I spent some time on it and I am
frightened by all that is left do, wandering how the members of
this list cope (ex: EU), with so many documents and languages.
Probably with libraries, but we have list members with better
information on this topic.
Do we have the tools that we need for the job at hand? Yet, do
not worry, I tame my fears and that is why I also try to optimize
the logic and processing for the Oxford comma, hoping that the
solution is good, but better, and especially if it is, I hope that
others can optimize it further so that we can some day, settle this
logic and move beyond the Oxford comma. What do you think?
In balance, I think that even if we etched the solution in stone,
we'd continue to get questions about it from time to time. There are
some questions that are far more common; just recently the thread
recurred on how to parse raw markup in XSLT. And even when questions
repeat, sometimes answers change.
But fortunately for everyone, this isn't up to me. If the task of
publishing a general solution is worth the overhead of doing it, you
or someone else can put it on the web, and maybe in time (maybe soon)
it will get as many hits as, say, Jeni Tennison's page explaining
Muenchian grouping. There is actually a FAQ where Dave Pawson has
included many such nuggets.
OT on the general question: sometimes structured data maps into
natural language quite readily. More often, not. But that's what
prose is for, and prose plus typography when prose alone doesn't
suffice. (Or, if spoken language is what we need, we can add a
whiteboard and hand gestures. Or promise to send email.) The number
of ways natural language can be refactored for greater clarity or
elegance of expression can't be counted, but there isn't a machine
that can do any of it. Writing a good prose paragraph requires
judgement, which is more than any algorithm. It is harder than chess.
Wendell Piez mailto:wapiez@xxxxxxxxxxxxxxxx
Mulberry Technologies, Inc. http://www.mulberrytech.com
17 West Jefferson Street Direct Phone: 301/315-9635
Suite 207 Phone: 301/315-9631
Rockville, MD 20850 Fax: 301/315-8285
Mulberry Technologies: A Consultancy Specializing in SGML and XML