Re: [xsl] XSLT function for title capitalization?

Subject: Re: [xsl] XSLT function for title capitalization?
From: "Liam R. E. Quin liam@xxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Tue, 10 Apr 2018 06:19:32 -0000
On Mon, 2018-04-09 at 20:52 +0000, David Sewell dsewell@xxxxxxxxxxxx
wrote:
> Wondering if anyone has a serviceable function (preferably in XSLT
> 2/3 but v1 is 
> fine if it works) that takes a string as input and returns it with
> title 
> capitalization according to English-language editorial practice (for
> example, 
> Chicago Manual of Style). 

I'd use replace() probably, rather than tokenizing, so as to change as
little as possible & facilitate regression tests.

Some test cases should include
* words that do and don't change at the start and at the end of input;
* words like o'clock and don't that include apostrophes, both as '
  and as b (it doesn't matter whether they are input as entities
  or literally or numeric character references though, as they all
  end up the same after XML parsing)
* hyphenated proper names like Rees-Mogg
* exceptions like Ladies-in-Waiting
* punctuation such as em dashes, quotes, commas, semicolons

Unfortunately XSLT doesn't give us Perl's wonderful e modifier on
substitution, and neither does XQuery (where it'd be more useful), but
XSLT does give us xsl:analyze-string. I'd start with David Carlisle's
approach and add a lot of test cases and fix the regexp to be something
more like
   (\w)(\w*(?:'\w+)?)
maybe.

An alternative is to replace (\w)'(\w) with $1E$2 everywhere, where E
is some Unicode upper-case letter or sequence of letters that
definitely doesn't occur in your input, and change it back at the end.

In XSLT 1 i'd cry for a while and then write something recursive that
split its input using translate() and substring-before() to find where
to split.

For https://words.fromoldbooks.org/Chalmers-Biography/ i use Perl, as
the input isn't well-formed XML at first, with a table of manual
overrides, but there are fewer than 10,000 entries i think. Once it's
in XMl my script/Makefile for conversion does use XSLT, taking 46
seconds to process 43MBytes of XML into 9771 separate XML files with
Saxon.

Liam


-- 
Liam Quin, W3C, http://www.w3.org/People/Quin/
Staff contact for Verifiable Claims WG, SVG WG, XQuery WG
Improving Web Advertising: https://www.w3.org/community/web-adv/
Personal: awesome vintage art: http://www.fromoldbooks.org/

Current Thread