Subject: Re: [jats-list] checking URIs From: Wendell Piez <wapiez@xxxxxxxxxxxxxxxx> Date: Wed, 20 Jun 2012 15:56:53 -0400 |
With respect to part b), you'll want to check the well-formedness of a URI as a starting point. If you're looking at this from a preservation perspective, you're probably getting content in that was tagged by others, and we've certainly seen cases of bad URIs, which are often an artifact of a PDF-to-XML conversion process (e.g. URIs with incorrect special characters such as ndashes instead of hyphens).
Once you've checked for well-formedness, you can then check to see if they resolve. We've done this and the work is non-trivial, and you'll probably see at least 10% link-rot. You'll also see than same URI come and go on a sometimes-daily basis, as we see in our automated testing, so it's not 100% reliable, but it's a good start. You also will see some URIs are redirected, and sometimes even redirected with a different result every single time you query because you're being tracked by the web site.
Indeed. This is yet a third and harder layer of "correctness". The grey area turns out to be hard to map, to say nothing of getting rid of it.
My thinking at this point is that if the source document happens to include a URI (or a mangled URI) as character data, this is not worth checking for validity. In character data, I can't tell apart a URI which for some reason isn't clickable and a fragment of a URI not meant to be clickable but which is included simply to illustrate a technical point. But I can assume that if @xlink:href is present, the URI is meant to be actionable.
Cheers, Wendell
-- ====================================================================== Wendell Piez mailto:wapiez@xxxxxxxxxxxxxxxx Mulberry Technologies, Inc. http://www.mulberrytech.com 17 West Jefferson Street Direct Phone: 301/315-9635 Suite 207 Phone: 301/315-9631 Rockville, MD 20850 Fax: 301/315-8285 ---------------------------------------------------------------------- Mulberry Technologies: A Consultancy Specializing in SGML and XML ======================================================================
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [jats-list] checking URIs, Bruce Rosenblum | Thread | Re: [jats-list] checking URIs, John Meyer |
Re: [jats-list] checking URIs, Kevin Hawkins | Date | [jats-list] Tagging volume title in, Schwarzman, Alexande |
Month |