Re: [jats-list] checking URIs

Subject: Re: [jats-list] checking URIs
From: Bruce Rosenblum <bruce@xxxxxxxxx>
Date: Thu, 14 Jun 2012 15:34:41 -0400
Kevin,

With respect to part b), you'll want to check the well-formedness of a URI as a starting point. If you're looking at this from a preservation perspective, you're probably getting content in that was tagged by others, and we've certainly seen cases of bad URIs, which are often an artifact of a PDF-to-XML conversion process (e.g. URIs with incorrect special characters such as ndashes instead of hyphens).

Once you've checked for well-formedness, you can then check to see if they resolve. We've done this and the work is non-trivial, and you'll probably see at least 10% link-rot. You'll also see than same URI come and go on a sometimes-daily basis, as we see in our automated testing, so it's not 100% reliable, but it's a good start. You also will see some URIs are redirected, and sometimes even redirected with a different result every single time you query because you're being tracked by the web site.

Bruce

At 05:30 PM 6/13/2012, Kevin Hawkins wrote:
I am interested in checking all URIs in JATS files. Well, not all URIs. I am happy to ignore a URI that might for some reason be mentioned in the prose of the article, but I'm interested in any actionable URIs, even relative URIs giving the location of, say, a graphic.

Given that scope ...

a) Is it true that all such URIs are found in @xlink:html? Are there any other attributes whose value is a URI? Or possibly an element whose content is supposed to be a URI?

b) Do people have suggestions on whether it's better to test that they resolve (a link checker), testing for well-formedness as a URI, or both? That is, does anyone know of URIs that resolve using some software but which aren't valid according to RFC 3986? If so, for preservation purposes, I think I would want to catch these.

--Kevin

-------------------------------------------------------------------
This email message and any attachments are confidential. If you are not the intended recipient, please immediately reply to the sender or call 617-932-1932 and delete the message from your email system. Thank you.
-------------------------------------------------------------------
Bruce D. Rosenblum
Inera Inc.
19 Flett Road
Belmont, MA 02478
617-932-1932 (office)
bruce@xxxxxxxxx


Current Thread