|
Subject: Re: [jats-list] Does Blue need a Lite version, to counter its creeping aquafication? From: "Pieter Lamers pieter.lamers@xxxxxxxxxxxx" <jats-list-service@xxxxxxxxxxxxxxxxxxxxxx> Date: Sun, 21 Feb 2021 07:40:16 -0000 |
Hi Gerrit,
That's a nice Sunday morning exercise. I wrote the following xquery to
summarize the fulltext articles:
xquery version '3.1';
let $coll as item()+ := collection('/db/data/journals.benjamins.com/')[article/body] (: requirement for 'body' is to leave out metadata-only records :)
let $article-count := count($coll)
return
<articles count="{$article-count}">{
for $element-group in $coll//*
group by $namespace := $element-group/node-name() => prefix-from-QName()
return
<elements>{
if( exists($namespace) ) then attribute prefix { $namespace } else (),
for $element in $element-group
group by $element-name := $element/local-name()
order by $element-name
return
element { $element-name } {
for $attribute in $element/@*
group by $attribute-name := $attribute/local-name()
order by $attribute-name
return
attribute { $attribute-name } { count($attribute) },
count($element)
}
}</elements>
}</articles>
I added counts as the attribute/element text value because it shows the
extent of use for each element/attribute. Please note that we use Green
(1.1) rather than Blue because of Blue's ordering restrictions in
references and other things we really needed (forgot which).B I
abstracted away from namespace prefixes in attribute names. It results
in the following:
<articles count="9171">
<elements>
<abstract id="1" lang="64">5699</abstract>
<ack id="292">2876</ack>
<addr-line content-type="12713" lang="4">20414</addr-line>
<address content-type="5" lang="109" specific-use="2">10909</address>
<aff id="10796" lang="391" specific-use="3">11356</aff>
<aff-alternatives id="281">281</aff-alternatives>
<alt-text>4</alt-text>
<alt-title alt-title-type="5532" lang="10" specific-use="93">5631</alt-title>
<alternatives>4</alternatives>
<app id="1940" lang="4" specific-use="36">2143</app>
<app-group id="10" lang="2" specific-use="6">1367</app-group>
<array content-type="33525" id="23383" lang="16059" orientation="49">40477</array>
<article article-type="8919" dtd-version="182" lang="9171">9171</article>
<article-categories>625</article-categories>
<article-id pub-id-type="18300">18300</article-id>
<article-meta>9175</article-meta>
<article-title lang="3185">159101</article-title>
<attrib>14340</attrib>
<author-comment content-type="8">31</author-comment>
<author-notes>3</author-notes>
<award-group id="232" specific-use="1">287</award-group>
<award-id id="43" rid="324">701</award-id>
<back>8534</back>
<bio id="538" lang="4">2746</bio>
<body>9179</body>
<bold toggle="1">121991</bold>
<book-part-id book-part-id-type="3">3</book-part-id>
<boxed-text content-type="286" id="206" position="7">503</boxed-text>
<break>8129</break>
<caption content-type="713">31905</caption>
<chapter-title lang="1256">69047</chapter-title>
<citation-alternatives>11</citation-alternatives>
<city specific-use="20">3227</city>
<code code-type="274" language="3">1143</code>
<col span="1106" style="1107" width="75">1192</col>
<colgroup span="2" width="2">379</colgroup>
<collab collab-type="1" lang="15" type="1">8459</collab>
<comment lang="16">128866</comment>
<conf-date iso-8601-date="5">1171</conf-date>
<conf-loc>1896</conf-loc>
<conf-name>2698</conf-name>
<conf-sponsor>38</conf-sponsor>
<conference>5</conference>
<contrib contrib-type="14024" corresp="2017" deceased="1" id="3">14095</contrib>
<contrib-group content-type="530">9240</contrib-group>
<contrib-id authenticated="1" content-type="1" contrib-id-type="24952" specific-use="1133">24954</contrib-id>
<copyright-holder>287</copyright-holder>
<copyright-statement>9189</copyright-statement>
<copyright-year>1318</copyright-year>
<country country="18715" specific-use="67">19273</country>
<counts>7</counts>
<custom-meta specific-use="3618">3625</custom-meta>
<custom-meta-group>3605</custom-meta-group>
<data-title>5</data-title>
<date date-type="5495" iso-8601-date="7">5495</date>
<date-in-citation content-type="7075">7075</date-in-citation>
<day>24339</day>
<def>11766</def>
<def-head>1</def-head>
<def-item>11771</def-item>
<def-list id="10">567</def-list>
<degrees>198</degrees>
<disp-formula id="347">999</disp-formula>
<disp-quote content-type="1578" id="6300" lang="438">18246</disp-quote>
<edition>5248</edition>
<email content-type="4">10580</email>
<etal>1807</etal>
<ext-link ext-link-type="213" href="197">234</ext-link>
<fax>5</fax>
<fig fig-type="1" id="15454" orientation="112" position="381">15533</fig>
<fig-group id="660" orientation="12" position="33">673</fig-group>
<fn fn-type="1" id="54540" lang="1">54561</fn>
<fn-group content-type="3">6974</fn-group>
<fpage id="3">204014</fpage>
<front>9175</front>
<funding-group specific-use="6">640</funding-group>
<funding-source id="379" lang="2" rid="47">839</funding-source>
<funding-statement>535</funding-statement>
<given-names>650325</given-names>
<glossary id="25">431</glossary>
<graphic alt="1" content-type="2" href="18582" id="19" mime-subtype="222" orientation="910" position="56" specific-use="3">18582</graphic>
<history>1916</history>
<inline-formula>6</inline-formula>
<inline-graphic href="2274" id="225" mime-subtype="69">2274</inline-graphic>
<inline-supplementary-material href="89" specific-use="73" title="89">89</inline-supplementary-material>
<institution content-type="8241" lang="61">29050</institution>
<institution-id institution-id-type="9938" specific-use="32">9938</institution-id>
<institution-wrap>11219</institution-wrap>
<isbn publication-format="109">1389</isbn>
<issn pub-type="16082" publication-format="16030">16112</issn>
<issue>86781</issue>
<issue-id pub-id-type="43">43</issue-id>
<issue-title>23</issue-title>
<italic toggle="1184">910040</italic>
<journal-id journal-id-type="25060">25060</journal-id>
<journal-meta>9148</journal-meta>
<journal-subtitle>3410</journal-subtitle>
<journal-title>9146</journal-title>
<journal-title-group>9146</journal-title-group>
<kwd>32653</kwd>
<kwd-group kwd-group-type="6" lang="571" specific-use="1">5854</kwd-group>
<label id="1" lang="1">251539</label>
<license license-type="825">825</license>
<license-p>825</license-p>
<list continued-from="2" id="78134" lang="142" list-content="60583" list-type="95606" type="1">96236</list>
<list-item id="832" lang="5">180293</list-item>
<lpage>202060</lpage>
<media href="70" id="70" mime-subtype="69" mimetype="70" orientation="1" position="70" specific-use="69">70</media>
<meta-name>3625</meta-name>
<meta-value>3625</meta-value>
<mixed-citation lang="9" publication-format="72" publication-type="332425">332438</mixed-citation>
<monospace>1534</monospace>
<month>25375</month>
<name content-type="214" lang="245" name-style="1244">16657</name>
<name-alternatives>3167</name-alternatives>
<named-content content-type="43532" id="18" lang="8848">43532</named-content>
<note>141</note>
<notes notes-type="311">312</notes>
<object-id pub-id-type="60" specific-use="60">60</object-id>
<overline>2</overline>
<p content-type="8068" id="394943" lang="15470">787195</p>
<page-count count="7">7</page-count>
<page-range>817</page-range>
<part-title>9</part-title>
<permissions>9147</permissions>
<person-group person-group-type="391649">393168</person-group>
<phone>17</phone>
<postal-code>2972</postal-code>
<prefix>68</prefix>
<preformat lang="3" orientation="3" position="3" space="1795" specific-use="1849">2020</preformat>
<price>808</price>
<principal-award-recipient>195</principal-award-recipient>
<principal-investigator>11</principal-investigator>
<product id="798" lang="10" product-type="1">1610</product>
<pub-date date-type="15365" iso-8601-date="15010" pub-type="16434" publication-format="15333">16526</pub-date>
<pub-id pub-id-type="164988" specific-use="11218">164988</pub-id>
<publisher>9148</publisher>
<publisher-loc lang="56">165961</publisher-loc>
<publisher-name lang="67">186936</publisher-name>
<rb>1040</rb>
<ref content-type="11" id="332449" lang="25">332449</ref>
<ref-list content-type="5" id="9" lang="1">9424</ref-list>
<related>1</related>
<related-article elocation-id="3" ext-link-type="4241" href="4237" issue="3" page="154" related-article-type="4241" vol="155">4241</related-article>
<related-object content-type="83293" object-id="82847" specific-use="82847">83293</related-object>
<role lang="3">76908</role>
<roman lang="7">12</roman>
<rt id="1">1040</rt>
<ruby content-type="862" id="2">1040</ruby>
<sc>199595</sc>
<season>2</season>
<sec disp-level="152" id="79856" lang="10" sec-type="16269">88627</sec>
<sec-meta>25</sec-meta>
<self-uri content-type="8002" href="8002">8002</self-uri>
<series>5461</series>
<sig>142</sig>
<sig-block content-type="1">141</sig-block>
<size units="620">620</size>
<source content-type="1" lang="5847">317600</source>
<speaker>6126</speaker>
<speech id="2">6126</speech>
<state>1289</state>
<std>2</std>
<std-organization>2</std-organization>
<strike>1259</strike>
<string-name content-type="1" lang="3401" name-style="4754" specific-use="36">638764</string-name>
<styled-content lang="3220" specific-use="174" style="1" style-type="11317">14599</styled-content>
<sub arrange="98">43367</sub>
<sub-article article-type="4" lang="4">4</sub-article>
<subj-group subj-group-type="669">692</subj-group>
<subject content-type="6">692</subject>
<subtitle lang="1">3845</subtitle>
<suffix>688</suffix>
<sup arrange="98">40502</sup>
<supplement>3</supplement>
<surname>652032</surname>
<table border="1" cellpadding="4" content-type="36" frame="18448" id="2" rules="19032" style="1" width="8">19589</table>
<table-wrap id="19590" lang="2" orientation="366" position="289" specific-use="3">19801</table-wrap>
<table-wrap-foot>3266</table-wrap-foot>
<table-wrap-group id="99" orientation="2" position="7">103</table-wrap-group>
<target id="5083" target-type="5083">5083</target>
<tbody align="4" id="7">63240</tbody>
<td align="883394" char="213" colspan="16266" content-type="11483" id="573" rowspan="19067" style="207" valign="16942">1427330</td>
<term id="11381">11771</term>
<term-head>1</term-head>
<textual-form>1</textual-form>
<tfoot>76</tfoot>
<th align="106660" colspan="13207" content-type="1419" id="12" rowspan="6840" style="31" valign="2027">107372</th>
<thead align="2">17495</thead>
<title id="206">152022</title>
<title-group>9094</title-group>
<tr align="3" content-type="3552" id="46" style="85" valign="16893">300606</tr>
<trans-abstract base="1" lang="1167" specific-use="1">1177</trans-abstract>
<trans-source content-type="1773" lang="6216">8322</trans-source>
<trans-subtitle lang="3">90</trans-subtitle>
<trans-title content-type="1320" lang="4864">8162</trans-title>
<trans-title-group lang="1751">1916</trans-title-group>
<underline underline-style="222">32511</underline>
<uri content-type="2" href="27093" lang="1" type="1">27158</uri>
<verse-group content-type="9" id="320" lang="157">1025</verse-group>
<verse-line id="23" lang="52">5600</verse-line>
<volume lang="1">145921</volume>
<x lang="72" space="6477">3402649</x>
<xref alt="30" ref-type="793018" rid="793533" specific-use="189">793543</xref>
<year>359953</year>
</elements>
<elements prefix="mml">
<math display="6" overflow="6">16</math>
<mfrac>3</mfrac>
<mi mathvariant="6">90</mi>
<mn>63</mn>
<mo>101</mo>
<mrow>14</mrow>
<msub>31</msub>
<msubsup>16</msubsup>
<msup>16</msup>
</elements>
<elements prefix="ali">
<free_to_read end_date="608" start_date="608">622</free_to_read>
<license_ref start_date="24">249</license_ref>
</elements>
</articles>
I hope this is of help.
All the best,
Pieter
On 18/02/2021 13:11, Imsieke, Gerrit, le-tex gerrit.imsieke@xxxxxxxxx wrote:
> Dear Mark,
>
> Thank you so much for taking the time to run the analysis and for
> filing the pull request.
>
> We will try to reproduce, using the cache files that you sent, under
> which circumstances the division by zero occurs. Then we'll see
> whether there is something else that we should do about it or whether
> your fix addresses the problem without distorting the results.
>
> To all others that submitted files to Nina already: Thank you, too!
>
> To everyone else who sits on tons of JATS and hasnbt sent anything
> yet: There's still 10 days left to put something together.
>
> Gerrit
>
> On 18.02.2021 12:07, DUNN, Mark wrote:
>> Dear Gerrit and Nina,
>>
>> I am happy to try and help with this project and I wish you both
>> every success.
>>
>> OUP is unable to supply the JATS XML unfortunately, but I've been
>> able to run the pipeline over a representative sample (with a small
>> fix which I've put into my Git fork) to produce some statistics.
>>
>> The output report and cache for 176 articles across our subject areas
>> are attached. The articles are all from the last 2 years of publishing.
>>
>> If you would like more, please let me know. OUP publishes in all the
>> areas you are looking at (STEM, HUM, ECON) so if you need more from a
>> particular area, I'll be happy to get some.
>>
>> Kind regards,
>> Mark Dunn
>> Lead Content Architect, Oxford University Press
>>
>>
>>
>> -----Original Message-----
>> From: Imsieke, Gerrit, le-tex gerrit.imsieke@xxxxxxxxx
>> <jats-list-service@xxxxxxxxxxxxxxxxxxxxxx>
>> Sent: 16 February 2021 17:13
>> To: jats-list@xxxxxxxxxxxxxxxxxxxxxx
>> Cc: nina_linn.reinhardt@xxxxxxxxxxxxxxxxxxxx
>> Subject: [jats-list] Does Blue need a Lite version, to counter its
>> creeping aquafication?
>>
>> Dear JATS Community,
>>
>> As announced in a previous message to this list [1], Nina Reinhardt
>> is currently working on her master's thesis in which she tries to
>> find a consensus customization for the (estimated) 90% of JATS users
>> that only need about half of Blue's available elements and attributes.
>>
>> My role in this is that I am co-supervising the thesis and that I
>> came up with the idea after another discussion on this list last
>> year, in which Tommie suggested that "a dozen different people (or
>> small groups) each craft[ed] a 'JATS Lite' and we compare[d] them" [2].
>>
>> This was our first idea: To provide a form with a list of available
>> elements and attributes, and people would be able to put together
>> their favorite Lite customization interactively.
>>
>> But then we thought that we should also offer a way for people to
>> upload representative JATS content from their production or
>> repositories and treat these collections as expressions of tagging
>> preferences, or as "de-facto customizations". And then she skipped
>> the interactive form part and focused entirely on analyzing these
>> collections and which metrics are applicable to them in order to
>> identify consensus customizations.
>>
>> Nina has written a paper in which she describes her approach and what
>> is needed to find this lean consensus customization (your data!):
>> https://docs.google.com/document/d/1jYDT0TkYP9Tg31Ldd9gFmdwSiu98Q2mg_qOuhgnxpRc/
>>
>>
>> You may skip most technical discussions for the time being and
>> navigate right to the last section called "Data Collection". It is a
>> call to action that asks you to donate some of your valuable JATS
>> files to research. Or you can use some XSLT [3] in order to extract
>> element/attribute name lists from the JATS files yourselves so you
>> need not send potentially proprietary data to someone else.
>>
>> Please donate generously, and if possible do it by March 1st. Nina's
>> thesis needs to be completed by June.
>>
>> You are allowed to add comments and suggestions to the Google doc,
>> you may of course file issues and pull requests in the Github repo,
>> and you can contact Nina and/or me via this list or direct email
>> messages if you have questions or suggestions.
>>
>> On behalf of Nina (and myself),
>>
>> Gerrit
>>
>> [1]
>> https://www.biglist.com/lists/lists.mulberrytech.com/jats-list/archives/202009/msg00019.html
>>
>> [2]
>> https://www.biglist.com/lists/lists.mulberrytech.com/jats-list/archives/202004/msg00030.html
>>
>> [3] https://github.com/nreinhar/JATS_Customizing_Analysis/
>>
>
>
--
Pieter Lamers
John Benjamins Publishing Company
Postal Address: P.O. Box 36224, 1020 ME AMSTERDAM, The Netherlands
Visiting Address: Klaprozenweg 75G, 1033 NN AMSTERDAM, The Netherlands
Warehouse: Kelvinstraat 11-13, 1446 TK PURMEREND, The Netherlands
tel: +31 20 630 4747
web: www.benjamins.com
| Current Thread |
|---|
|
| <- Previous | Index | Next -> |
|---|---|---|
| Re: [jats-list] Does Blue need a Li, Imsieke, Gerrit, le- | Thread | Re: [jats-list] Does Blue need a Li, Imsieke, Gerrit, le- |
| [jats-list] [ANN] Balisage 2021 - C, B Tommie Usdin btusd | Date | Re: [jats-list] Does Blue need a Li, Imsieke, Gerrit, le- |
| Month |