Re: [jats-list] checking URIs

Subject: Re: [jats-list] checking URIs
From: Wendell Piez <wapiez@xxxxxxxxxxxxxxxx>
Date: Wed, 20 Jun 2012 15:56:53 -0400
Hi Kevin,

On 6/14/2012 3:34 PM, Bruce Rosenblum wrote:
With respect to part b), you'll want to check the well-formedness of
a URI as a starting point. If you're looking at this from a
preservation perspective, you're probably getting content in that was
tagged by others, and we've certainly seen cases of bad URIs, which
are often an artifact of a PDF-to-XML conversion process (e.g. URIs
with incorrect special characters such as ndashes instead of
hyphens).

Note that checking the syntax of URIs can be done straightforwardly in
XSLT 2.0, XQuery or Schematron using xslt2 as its query language. As
long as you are confident that you can rely on your XSLT/XQuery platform to know the RFC, you can test whether the string value can cast to an xs:anyURI datatype.


Where there are also other constraints over URIs, such as forbidding relative URIs -- these can generally be checked too (depending on the details of course). But as for "URIs that work but are not correct syntactically" -- I think you have a can of worms there, largely due to the softness (context-dependency) of the definition of "work".

Once you've checked for well-formedness, you can then check to see if
 they resolve. We've done this and the work is non-trivial, and
you'll probably see at least 10% link-rot. You'll also see than same
URI come and go on a sometimes-daily basis, as we see in our
automated testing, so it's not 100% reliable, but it's a good start.
You also will see some URIs are redirected, and sometimes even
redirected with a different result every single time you query
because you're being tracked by the web site.

Indeed. This is yet a third and harder layer of "correctness". The grey area turns out to be hard to map, to say nothing of getting rid of it.

Finally, Kevin writes
My thinking at this point is that if the source document happens to
include a URI (or a mangled URI) as character data, this is not worth
checking for validity.  In character data, I can't tell apart a URI
which for some reason isn't clickable and a fragment of a URI not
meant to be clickable but which is included simply to illustrate a
technical point.  But I can assume that if @xlink:href is present,
the URI is meant to be actionable.

I don't think that's an unfair assumption. Yet @xlink:href is allowed in many places in JATS on the simple grounds that someone somewhere needs it, and then since it's permitted, you can't rule out that it will be deployed provisionally or as a matter of policy, without being put to use (i.e. by projects that think they need it but don't, or that may need it but never get around to using it, etc.) and tested "under load".


I guess this makes knowing which ones actually do work all the more interesting and useful.

Cheers,
Wendell

--
======================================================================
Wendell Piez                            mailto:wapiez@xxxxxxxxxxxxxxxx
Mulberry Technologies, Inc.                http://www.mulberrytech.com
17 West Jefferson Street                    Direct Phone: 301/315-9635
Suite 207                                          Phone: 301/315-9631
Rockville, MD  20850                                 Fax: 301/315-8285
----------------------------------------------------------------------
  Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================

Current Thread