Re: [jats-list] Does Blue need a Lite version, to counter its creeping aquafication?

Subject: Re: [jats-list] Does Blue need a Lite version, to counter its creeping aquafication?
From: "Imsieke, Gerrit, le-tex gerrit.imsieke@xxxxxxxxx" <jats-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Tue, 23 Feb 2021 14:46:57 -0000
Hi Jeff,

Thanks for your encouraging words!

Martin Latterner has also made us aware (off-list) of the vast collection of full-text OA articles that can be downloaded from PMC.

After struggling with shaky internet connections and ncftp apparently ignoring my 'binary' instruction (the downloaded .tar.gz files were always a bit larger than the purported sizes, and they were corrupted), I now downloaded https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_bulk/comm_use.A-B.xml.tar.gz using a Web browser, gunzipped it and found 379,835 articles in the listing. So this looks really exciting. We will probably select a percentage of journals/articles from this bulk randomly and analyze it. Or we will generate a configuration for the whole lot and process it over night, main memory permitting. Of course we can process the input in batches, too. Or feed it into an XML database first.

I can currently only speak for myself, but I'd be glad to present some of the intermediate results at the JATS-Con Open Session. I think Nina might be interested, too, albeit maybe a bit awe-struck to speak at an international conference for the first time, in a foreign language ;)

And we will see whether the intermediate results look promising enough to justify a Balisage submission.

Gerrit

On 22.02.2021 20:42, Beck, Jeff (NIH/NLM/NCBI) [E] beck@xxxxxxxxxxxxxxxx wrote:
Hi Nina and Gerrit,

This looks like really interesting work.

You may have investigated this already, but we have a number of JATS and NLM XML files available for text mining use from the PMC corpus. You can grab them by FTP and do what you like with them. Probably you will want the Open Access Subset

https://www.ncbi.nlm.nih.gov/pmc/tools/textmining/#oasubset <https://www.ncbi.nlm.nih.gov/pmc/tools/textmining/#oasubset>

But you can supplement that with XML from the NIH Author Manuscript collection if you need a few more articles.

JATS-Con will be on April 27 and 28 this year. We will be having an Open Session on Wednesday. I hope you can give everyone who has been following along on the JATS List an update of your progress.

Also, this work might be particularly interesting to the greater markup community. Balisage just posted its Call for Participation for the meeting in early August. http://www.balisage.net/Call4Participation.html <http://www.balisage.net/Call4Participation.html>

I know others there would be interested in hearing about this.

Good luck!

Jeff

------------------------------------------------------------------------
*From:* Imsieke, Gerrit, le-tex gerrit.imsieke@xxxxxxxxx <jats-list-service@xxxxxxxxxxxxxxxxxxxxxx>
*Sent:* Tuesday, February 16, 2021 12:12 PM
*To:* jats-list@xxxxxxxxxxxxxxxxxxxxxx <jats-list@xxxxxxxxxxxxxxxxxxxxxx>
*Cc:* nina_linn.reinhardt@xxxxxxxxxxxxxxxxxxxx <nina_linn.reinhardt@xxxxxxxxxxxxxxxxxxxx>
*Subject:* [jats-list] Does Blue need a Lite version, to counter its creeping aquafication?
Dear JATS Community,


As announced in a previous message to this list [1], Nina Reinhardt is
currently working on her master's thesis in which she tries to find a
consensus customization for the (estimated) 90% of JATS users that only
need about half of Blue's available elements and attributes.

My role in this is that I am co-supervising the thesis and that I came
up with the idea after another discussion on this list last year, in
which Tommie suggested that "a dozen different people (or small groups)
each craft[ed] a 'JATS Lite' and we compare[d] them" [2].

This was our first idea: To provide a form with a list of available
elements and attributes, and people would be able to put together their
favorite Lite customization interactively.

But then we thought that we should also offer a way for people to upload
representative JATS content from their production or repositories and
treat these collections as expressions of tagging preferences, or as
"de-facto customizations". And then she skipped the interactive form
part and focused entirely on analyzing these collections and which
metrics are applicable to them in order to identify consensus
customizations.

Nina has written a paper in which she describes her approach and what is
needed to find this lean consensus customization (your data!):
https://docs.google.com/document/d/1jYDT0TkYP9Tg31Ldd9gFmdwSiu98Q2mg_qOuhgnxpRc/ <https://docs.google.com/document/d/1jYDT0TkYP9Tg31Ldd9gFmdwSiu98Q2mg_qOuhgnxpRc/>


You may skip most technical discussions for the time being and navigate
right to the last section called "Data Collection". It is a call to
action that asks you to donate some of your valuable JATS files to
research. Or you can use some XSLT [3] in order to extract
element/attribute name lists from the JATS files yourselves so you need
not send potentially proprietary data to someone else.

Please donate generously, and if possible do it by March 1st. Nina's
thesis needs to be completed by June.

You are allowed to add comments and suggestions to the Google doc, you
may of course file issues and pull requests in the Github repo, and you
can contact Nina and/or me via this list or direct email messages if you
have questions or suggestions.

On behalf of Nina (and myself),

Gerrit

[1]
https://www.biglist.com/lists/lists.mulberrytech.com/jats-list/archives/202009/msg00019.html <https://www.biglist.com/lists/lists.mulberrytech.com/jats-list/archives/202009/msg00019.html>
[2]
https://www.biglist.com/lists/lists.mulberrytech.com/jats-list/archives/202004/msg00030.html <https://www.biglist.com/lists/lists.mulberrytech.com/jats-list/archives/202004/msg00030.html>
[3] https://github.com/nreinhar/JATS_Customizing_Analysis/ <https://github.com/nreinhar/JATS_Customizing_Analysis/>


--
Gerrit Imsieke
GeschC$ftsfC<hrer / Managing Director
le-tex publishing services GmbH
Weissenfelser Str. 84, 04229 Leipzig, Germany
Phone +49 341 355356 110, Fax +49 341 355356 510
gerrit.imsieke@xxxxxxxxx, http://www.le-tex.de <http://www.le-tex.de>

Registergericht / Commercial Register: Amtsgericht Leipzig
Registernummer / Registration Number: HRB 24930

GeschC$ftsfC<hrer / Managing Directors:
Gerrit Imsieke, Svea Jelonek, Thomas Schmidt


JATS-List info and archive <http://www.mulberrytech.com/JATS/JATS-List/>
EasyUnsubscribe <http://lists.mulberrytech.com/unsub/jats-list/225679> (by email <>)

-- Gerrit Imsieke GeschC$ftsfC<hrer / Managing Director le-tex publishing services GmbH Weissenfelser Str. 84, 04229 Leipzig, Germany Phone +49 341 355356 110, Fax +49 341 355356 510 gerrit.imsieke@xxxxxxxxx, http://www.le-tex.de

Registergericht / Commercial Register: Amtsgericht Leipzig
Registernummer / Registration Number: HRB 24930

GeschC$ftsfC<hrer / Managing Directors:
Gerrit Imsieke, Svea Jelonek, Thomas Schmidt

Current Thread