Re: [xsl] The Oxford Comma - A Gift Worth Atleast 5 Cents

Subject: Re: [xsl] The Oxford Comma - A Gift Worth Atleast 5 Cents
From: Wendell Piez <wapiez@xxxxxxxxxxxxxxxx>
Date: Fri, 20 Jun 2008 16:46:53 -0400

As it relates to XSL, the topic of natural language processing is in scope. In the previous messages in this thread, it appeared to me that the discussion had quickly and decisively moved away from XSLT towards more philosophical questions -- which themselves weren't actually being engaged with much of the sort of discipline that might warrant the hope that they would come back to XSLT. (I admit we digress sometimes, but we really shouldn't, and mostly when we do it's with at least an eye on why we're here.)

At 01:43 PM 6/20/2008, you wrote:
Thank you for ensuring that we keep in scope. I instigated this with an almost caricaturistic remark on human languages, at the end of my initial post. What I was thinking, that was not that clear then, was that we have to process human languages a lot. I did not check, but it is possible that most of processing and issues that come through this list are language processing related and it does not seem like it is going to get any better soon. There are also many documents, in many languages, and we need to process them, all of the time, more and more.

I think it's necessary to make a distinction between processing natural language inputs and generating natural language.

For example,

<item>Cisco IP phone end user training</item>
<item>Cisco attendant console operator training</item>
<item>Cisco call center agent training</item>

"Cisco IP phone end user training, Cisco attendant console operator training, and Cisco call center agent training"

It makes a big difference whether A is to be transformed to B, or B is to be transformed to A.

B to A is indeed the concern of an entire subdiscipline of computational linguistics.

A to B is tractable in XSLT, and remains so (although it becomes more complex) when working with arbitrary sets of items. But depending on the requirements, one might have to adjust not only for different numbers of items, but for anomalous inputs of various kinds. (For example, one might want to defend against duplicate items in the set. Or, if one avoids the Oxford comma in generating English, does one restore it when given a final or penultimate item containing the word "and"?)

XSLT2 is possibly our best tool so far.

For A to B, probably (though I note that your rewrite might as well be XQuery). For B to A, probably not outside simple cases (though without looking into it, I couldn't actually tell you what the state-of-the-art NLP parsers are using). Someone could be ready to prove me wrong, which would be very cool.

On this this track, we are primarily looking at processing the Oxford comma, in English, with XSLT. As here, by law, we have to support and process at least both English and French, back and forth too, on input and output, and the rules are different for each case, I may be a bit sensitive on the subject. I am sure that Ronnie took some serious time to resolve the stylesheet like he did and I only tried to optimize it further, for English in XSLT2. I am not sure if I succeeded or better, how we can further improve on this case, but I am sure that I spent some time on it and I am frightened by all that is left do, wandering how the members of this list cope (ex: EU), with so many documents and languages.

Probably with libraries, but we have list members with better information on this topic.

Do we have the tools that we need for the job at hand? Yet, do not worry, I tame my fears and that is why I also try to optimize the logic and processing for the Oxford comma, hoping that the solution is good, but better, and especially if it is, I hope that others can optimize it further so that we can some day, settle this logic and move beyond the Oxford comma. What do you think?

In balance, I think that even if we etched the solution in stone, we'd continue to get questions about it from time to time. There are some questions that are far more common; just recently the thread recurred on how to parse raw markup in XSLT. And even when questions repeat, sometimes answers change.

But fortunately for everyone, this isn't up to me. If the task of publishing a general solution is worth the overhead of doing it, you or someone else can put it on the web, and maybe in time (maybe soon) it will get as many hits as, say, Jeni Tennison's page explaining Muenchian grouping. There is actually a FAQ where Dave Pawson has included many such nuggets.

OT on the general question: sometimes structured data maps into natural language quite readily. More often, not. But that's what prose is for, and prose plus typography when prose alone doesn't suffice. (Or, if spoken language is what we need, we can add a whiteboard and hand gestures. Or promise to send email.) The number of ways natural language can be refactored for greater clarity or elegance of expression can't be counted, but there isn't a machine that can do any of it. Writing a good prose paragraph requires judgement, which is more than any algorithm. It is harder than chess.


Wendell Piez                            mailto:wapiez@xxxxxxxxxxxxxxxx
Mulberry Technologies, Inc.      
17 West Jefferson Street                    Direct Phone: 301/315-9635
Suite 207                                          Phone: 301/315-9631
Rockville, MD  20850                                 Fax: 301/315-8285
  Mulberry Technologies: A Consultancy Specializing in SGML and XML

Current Thread