Re: [jats-list] Why is archiving JATS with a DOI not common?

Subject: Re: [jats-list] Why is archiving JATS with a DOI not common?
From: "Alexander Schwarzman aschwarzman@xxxxxxxxx" <jats-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Sun, 1 May 2022 02:40:23 -0000
Hi Castedo,

Let me try to clarify confusing terminology that surrounds the notion of
DOI and also comment on the difference between the Accepted Manuscript and
the Published Article.

The dirty little secret is that DOI -- which, in principle, stands for
Digital Object Identifier -- does not identify any digital objects (sigh).
Instead, it identifies a set of objects. A DOI does not resolve to any
particular representation of the content; instead, a DOI resolves into
something called "a response page" or "a landing page", which is a web page
that contains links to *some* manifestations of the content in question
(more on that later).

A little historical digression: in the beginning of electronic publishing,
there were efforts to establish an ID that would identify the content,
rather than its manifestations. There were also efforts to assign different
IDs to different manifestations, e.g., to a PDF, to an HTML, and to an EPUB
manifestations, so that different digital objects would get different IDs.
Both efforts, unfortunately, have failed, and now the so-called "Digital
Object Identifier" identifies a whole bunch of digital objects, e.g., an
XML source, its PDF, HTML, and EPUB manifestations, as well as a
pre-published manuscript, whose content is different from the final
published article.

Now, when a DOI is assigned, there are many Registration Agencies
<https://www.doi.org/registration_agencies.html> (RAs) that can register
the DOI. You can follow the links at the site to find out more. Crossref is
just one of many RAs.

So, when you ask about resolving a DOI to the XML format of the content in
question, this is not, technically speaking, correct: a DOI resolves to a
response page, which may or may not provide a link to the XML format of the
content you are interested in.

The logical question then is to ask: Why don't all publishers provide the
XML source of their content?
Some publishers do (e.g., PLOS, The Royal Society)  while most don't. You
see, articles are published under different licences: some are proprietary;
and others, even though allowing free sharing, such as CC BY-ND 4.0, allow
no derivatives. If all XML were made publicly available, then it would be
very easy for unscrupulous actors to create a different manifestation of
that XML and publish it, to compile a collection of existing articles, etc.
A publisher, especially a non-for-profit or a small one, or a publisher
that has subscription journals, simply doesn't have wherewithal or
financial resources to police that kind of nefarious activities, not to
mention to engage into expensive lawsuits, especially if the culprit is
located outside of the publisher country's jurisdiction. And thus, the XML
is not always available. For example, I don't believe you can get the XML
of your 2007 article https://doi.org/10.1016/j.ic.2006.10.007, that you
published with Elsevier in 2007, because it is under the Elsevier user
license.

You've also asked about the difference between the Accepted Manuscript (AM)
(a.k.a. "ahead-of -print manuscript", "author manuscript", etc.) and the
final published article. There are differences between the version that was
peer-reviewed and scientifically accepted and the final version (Version of
Record, VoR). I'll refer you to the list of 102 Things Journal Publishers Do
<https://scholarlykitchen.sspnet.org/2018/02/06/focusing-value-102-things-journal-publishers-2018-update/>
for the complete list; some of the things worth mentioning in the context
of highlighting the differences between the AM and VoR are 34,
Copy-editing, proofreading, and styling; 35. Language and substantive
editing; 37. Art handling; 39. Layout and composition; 41. XML generation
and DTD migration; 44. Tagging; 45. DOI registration; 57. Depositing
content and data; and 60. Hosting and archiving; to mention just a few. The
publishers add value to the peer-reviewed content, and that is why, in your
example, Elsevier requests $25 for the final version of the article.

Returning to what I alluded to in the beginning of my message, the same
DOI, unfortunately, refers not only to the various manifestations of the
Version of Record, but also to the Accepted Manuscript, whose content is
different from te VoR. In my opinion, this is a bloody mess, but this is
what the educated consumer should be aware of ("buyer beware"). If it is
any consolation, at least a preprint (which may or may not become a journal
article) has a different DOI.

Finally, DOI is not the only identifier out there. PubMed ID is a different
identifier.

If this is clear as mud, I'm sorry.

--Sasha
Alexander ('Sasha') Schwarzman
Content Technology Architect
tel: +1.202.416.1979
aschwarzman@xxxxxxxxxx







On Sat, Apr 30, 2022 at 5:06 PM Castedo Ellerman castedo@xxxxxxxxxxx <
jats-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote:

> On 4/29/22 07:26, David Haber wrote:
> > After reading your question a few times more, are you asking why the
> > specific XML component or format of a given article does not have its
> > own unique DOI?
>
> More or less, yes, that is what I was wondering, thank you.
>
> > So, perhaps a publisher HTML version would have a DOI, maybe the PDF
> > would have a DOI, perhaps an ePub would have a DOI, and maybe the XML?
> > And all these dois would be unique?
> >
> > If that is your question, then the reason is that the article is the
> > unit of measure in scholarly publishing, and those other versions are
> > just that, versions or different formats. The content is not unique to
> > the format so therefore would not get a separate doi. It is true that
> > different formats may display a piece of an article differently (or
> > maybe not at all) but that does not make the format unique because the
> > DOI represents the entire published object and all its formats because
> > that is the unique piece we as publishers are shepherding to the world.
>
> I have some clarifications to ask on a few of the terms you've used. I
> ask specifically about the DOI 10.1016/j.tpb.2018.03.006. Here are three
> ways I can resolve that DOI to three different digital objects:
>
> 1) Via doi.org I am sent to a web page where Elsevier requests $25 to
> view a PDF file.
>
> 2) In Zotero I can enter the doi and I get a free PDF (which is labeled
> Author manuscript)
>
> 3) I can enter the DOI on PubMed Central and freely see an HTML page
> (also labeled Author manuscript)
>
> I assume 1) resolves to different content than 2) and 3) because
> Elsevier wants $25.
>
> So we have one DOI which is representing two different sets of content
> here? Or does the DOI represent only the $25 article and not the author
> manuscript?
>
> What is the unit of measure in scholarly publishing in this case?
>
> Is the Author manuscript provided by PubMed Central and Zotero part of
> the entire published object or not part?
>
> Is the PubMed Central web page content here not a published object?
>
> Thank you,
>    Castedo

Current Thread