Subject: Re: [jats-list] Does Blue need a Lite version, to counter its creeping aquafication? From: "Pieter Lamers pieter.lamers@xxxxxxxxxxxx" <jats-list-service@xxxxxxxxxxxxxxxxxxxxxx> Date: Sun, 21 Feb 2021 07:40:16 -0000 |
Hi Gerrit, That's a nice Sunday morning exercise. I wrote the following xquery to summarize the fulltext articles: xquery version '3.1'; let $coll as item()+ := collection('/db/data/journals.benjamins.com/')[article/body] (: requirement for 'body' is to leave out metadata-only records :) let $article-count := count($coll) return <articles count="{$article-count}">{ for $element-group in $coll//* group by $namespace := $element-group/node-name() => prefix-from-QName() return <elements>{ if( exists($namespace) ) then attribute prefix { $namespace } else (), for $element in $element-group group by $element-name := $element/local-name() order by $element-name return element { $element-name } { for $attribute in $element/@* group by $attribute-name := $attribute/local-name() order by $attribute-name return attribute { $attribute-name } { count($attribute) }, count($element) } }</elements> }</articles> I added counts as the attribute/element text value because it shows the extent of use for each element/attribute. Please note that we use Green (1.1) rather than Blue because of Blue's ordering restrictions in references and other things we really needed (forgot which).B I abstracted away from namespace prefixes in attribute names. It results in the following: <articles count="9171"> <elements> <abstract id="1" lang="64">5699</abstract> <ack id="292">2876</ack> <addr-line content-type="12713" lang="4">20414</addr-line> <address content-type="5" lang="109" specific-use="2">10909</address> <aff id="10796" lang="391" specific-use="3">11356</aff> <aff-alternatives id="281">281</aff-alternatives> <alt-text>4</alt-text> <alt-title alt-title-type="5532" lang="10" specific-use="93">5631</alt-title> <alternatives>4</alternatives> <app id="1940" lang="4" specific-use="36">2143</app> <app-group id="10" lang="2" specific-use="6">1367</app-group> <array content-type="33525" id="23383" lang="16059" orientation="49">40477</array> <article article-type="8919" dtd-version="182" lang="9171">9171</article> <article-categories>625</article-categories> <article-id pub-id-type="18300">18300</article-id> <article-meta>9175</article-meta> <article-title lang="3185">159101</article-title> <attrib>14340</attrib> <author-comment content-type="8">31</author-comment> <author-notes>3</author-notes> <award-group id="232" specific-use="1">287</award-group> <award-id id="43" rid="324">701</award-id> <back>8534</back> <bio id="538" lang="4">2746</bio> <body>9179</body> <bold toggle="1">121991</bold> <book-part-id book-part-id-type="3">3</book-part-id> <boxed-text content-type="286" id="206" position="7">503</boxed-text> <break>8129</break> <caption content-type="713">31905</caption> <chapter-title lang="1256">69047</chapter-title> <citation-alternatives>11</citation-alternatives> <city specific-use="20">3227</city> <code code-type="274" language="3">1143</code> <col span="1106" style="1107" width="75">1192</col> <colgroup span="2" width="2">379</colgroup> <collab collab-type="1" lang="15" type="1">8459</collab> <comment lang="16">128866</comment> <conf-date iso-8601-date="5">1171</conf-date> <conf-loc>1896</conf-loc> <conf-name>2698</conf-name> <conf-sponsor>38</conf-sponsor> <conference>5</conference> <contrib contrib-type="14024" corresp="2017" deceased="1" id="3">14095</contrib> <contrib-group content-type="530">9240</contrib-group> <contrib-id authenticated="1" content-type="1" contrib-id-type="24952" specific-use="1133">24954</contrib-id> <copyright-holder>287</copyright-holder> <copyright-statement>9189</copyright-statement> <copyright-year>1318</copyright-year> <country country="18715" specific-use="67">19273</country> <counts>7</counts> <custom-meta specific-use="3618">3625</custom-meta> <custom-meta-group>3605</custom-meta-group> <data-title>5</data-title> <date date-type="5495" iso-8601-date="7">5495</date> <date-in-citation content-type="7075">7075</date-in-citation> <day>24339</day> <def>11766</def> <def-head>1</def-head> <def-item>11771</def-item> <def-list id="10">567</def-list> <degrees>198</degrees> <disp-formula id="347">999</disp-formula> <disp-quote content-type="1578" id="6300" lang="438">18246</disp-quote> <edition>5248</edition> <email content-type="4">10580</email> <etal>1807</etal> <ext-link ext-link-type="213" href="197">234</ext-link> <fax>5</fax> <fig fig-type="1" id="15454" orientation="112" position="381">15533</fig> <fig-group id="660" orientation="12" position="33">673</fig-group> <fn fn-type="1" id="54540" lang="1">54561</fn> <fn-group content-type="3">6974</fn-group> <fpage id="3">204014</fpage> <front>9175</front> <funding-group specific-use="6">640</funding-group> <funding-source id="379" lang="2" rid="47">839</funding-source> <funding-statement>535</funding-statement> <given-names>650325</given-names> <glossary id="25">431</glossary> <graphic alt="1" content-type="2" href="18582" id="19" mime-subtype="222" orientation="910" position="56" specific-use="3">18582</graphic> <history>1916</history> <inline-formula>6</inline-formula> <inline-graphic href="2274" id="225" mime-subtype="69">2274</inline-graphic> <inline-supplementary-material href="89" specific-use="73" title="89">89</inline-supplementary-material> <institution content-type="8241" lang="61">29050</institution> <institution-id institution-id-type="9938" specific-use="32">9938</institution-id> <institution-wrap>11219</institution-wrap> <isbn publication-format="109">1389</isbn> <issn pub-type="16082" publication-format="16030">16112</issn> <issue>86781</issue> <issue-id pub-id-type="43">43</issue-id> <issue-title>23</issue-title> <italic toggle="1184">910040</italic> <journal-id journal-id-type="25060">25060</journal-id> <journal-meta>9148</journal-meta> <journal-subtitle>3410</journal-subtitle> <journal-title>9146</journal-title> <journal-title-group>9146</journal-title-group> <kwd>32653</kwd> <kwd-group kwd-group-type="6" lang="571" specific-use="1">5854</kwd-group> <label id="1" lang="1">251539</label> <license license-type="825">825</license> <license-p>825</license-p> <list continued-from="2" id="78134" lang="142" list-content="60583" list-type="95606" type="1">96236</list> <list-item id="832" lang="5">180293</list-item> <lpage>202060</lpage> <media href="70" id="70" mime-subtype="69" mimetype="70" orientation="1" position="70" specific-use="69">70</media> <meta-name>3625</meta-name> <meta-value>3625</meta-value> <mixed-citation lang="9" publication-format="72" publication-type="332425">332438</mixed-citation> <monospace>1534</monospace> <month>25375</month> <name content-type="214" lang="245" name-style="1244">16657</name> <name-alternatives>3167</name-alternatives> <named-content content-type="43532" id="18" lang="8848">43532</named-content> <note>141</note> <notes notes-type="311">312</notes> <object-id pub-id-type="60" specific-use="60">60</object-id> <overline>2</overline> <p content-type="8068" id="394943" lang="15470">787195</p> <page-count count="7">7</page-count> <page-range>817</page-range> <part-title>9</part-title> <permissions>9147</permissions> <person-group person-group-type="391649">393168</person-group> <phone>17</phone> <postal-code>2972</postal-code> <prefix>68</prefix> <preformat lang="3" orientation="3" position="3" space="1795" specific-use="1849">2020</preformat> <price>808</price> <principal-award-recipient>195</principal-award-recipient> <principal-investigator>11</principal-investigator> <product id="798" lang="10" product-type="1">1610</product> <pub-date date-type="15365" iso-8601-date="15010" pub-type="16434" publication-format="15333">16526</pub-date> <pub-id pub-id-type="164988" specific-use="11218">164988</pub-id> <publisher>9148</publisher> <publisher-loc lang="56">165961</publisher-loc> <publisher-name lang="67">186936</publisher-name> <rb>1040</rb> <ref content-type="11" id="332449" lang="25">332449</ref> <ref-list content-type="5" id="9" lang="1">9424</ref-list> <related>1</related> <related-article elocation-id="3" ext-link-type="4241" href="4237" issue="3" page="154" related-article-type="4241" vol="155">4241</related-article> <related-object content-type="83293" object-id="82847" specific-use="82847">83293</related-object> <role lang="3">76908</role> <roman lang="7">12</roman> <rt id="1">1040</rt> <ruby content-type="862" id="2">1040</ruby> <sc>199595</sc> <season>2</season> <sec disp-level="152" id="79856" lang="10" sec-type="16269">88627</sec> <sec-meta>25</sec-meta> <self-uri content-type="8002" href="8002">8002</self-uri> <series>5461</series> <sig>142</sig> <sig-block content-type="1">141</sig-block> <size units="620">620</size> <source content-type="1" lang="5847">317600</source> <speaker>6126</speaker> <speech id="2">6126</speech> <state>1289</state> <std>2</std> <std-organization>2</std-organization> <strike>1259</strike> <string-name content-type="1" lang="3401" name-style="4754" specific-use="36">638764</string-name> <styled-content lang="3220" specific-use="174" style="1" style-type="11317">14599</styled-content> <sub arrange="98">43367</sub> <sub-article article-type="4" lang="4">4</sub-article> <subj-group subj-group-type="669">692</subj-group> <subject content-type="6">692</subject> <subtitle lang="1">3845</subtitle> <suffix>688</suffix> <sup arrange="98">40502</sup> <supplement>3</supplement> <surname>652032</surname> <table border="1" cellpadding="4" content-type="36" frame="18448" id="2" rules="19032" style="1" width="8">19589</table> <table-wrap id="19590" lang="2" orientation="366" position="289" specific-use="3">19801</table-wrap> <table-wrap-foot>3266</table-wrap-foot> <table-wrap-group id="99" orientation="2" position="7">103</table-wrap-group> <target id="5083" target-type="5083">5083</target> <tbody align="4" id="7">63240</tbody> <td align="883394" char="213" colspan="16266" content-type="11483" id="573" rowspan="19067" style="207" valign="16942">1427330</td> <term id="11381">11771</term> <term-head>1</term-head> <textual-form>1</textual-form> <tfoot>76</tfoot> <th align="106660" colspan="13207" content-type="1419" id="12" rowspan="6840" style="31" valign="2027">107372</th> <thead align="2">17495</thead> <title id="206">152022</title> <title-group>9094</title-group> <tr align="3" content-type="3552" id="46" style="85" valign="16893">300606</tr> <trans-abstract base="1" lang="1167" specific-use="1">1177</trans-abstract> <trans-source content-type="1773" lang="6216">8322</trans-source> <trans-subtitle lang="3">90</trans-subtitle> <trans-title content-type="1320" lang="4864">8162</trans-title> <trans-title-group lang="1751">1916</trans-title-group> <underline underline-style="222">32511</underline> <uri content-type="2" href="27093" lang="1" type="1">27158</uri> <verse-group content-type="9" id="320" lang="157">1025</verse-group> <verse-line id="23" lang="52">5600</verse-line> <volume lang="1">145921</volume> <x lang="72" space="6477">3402649</x> <xref alt="30" ref-type="793018" rid="793533" specific-use="189">793543</xref> <year>359953</year> </elements> <elements prefix="mml"> <math display="6" overflow="6">16</math> <mfrac>3</mfrac> <mi mathvariant="6">90</mi> <mn>63</mn> <mo>101</mo> <mrow>14</mrow> <msub>31</msub> <msubsup>16</msubsup> <msup>16</msup> </elements> <elements prefix="ali"> <free_to_read end_date="608" start_date="608">622</free_to_read> <license_ref start_date="24">249</license_ref> </elements> </articles> I hope this is of help. All the best, Pieter On 18/02/2021 13:11, Imsieke, Gerrit, le-tex gerrit.imsieke@xxxxxxxxx wrote: > Dear Mark, > > Thank you so much for taking the time to run the analysis and for > filing the pull request. > > We will try to reproduce, using the cache files that you sent, under > which circumstances the division by zero occurs. Then we'll see > whether there is something else that we should do about it or whether > your fix addresses the problem without distorting the results. > > To all others that submitted files to Nina already: Thank you, too! > > To everyone else who sits on tons of JATS and hasnbt sent anything > yet: There's still 10 days left to put something together. > > Gerrit > > On 18.02.2021 12:07, DUNN, Mark wrote: >> Dear Gerrit and Nina, >> >> I am happy to try and help with this project and I wish you both >> every success. >> >> OUP is unable to supply the JATS XML unfortunately, but I've been >> able to run the pipeline over a representative sample (with a small >> fix which I've put into my Git fork) to produce some statistics. >> >> The output report and cache for 176 articles across our subject areas >> are attached. The articles are all from the last 2 years of publishing. >> >> If you would like more, please let me know. OUP publishes in all the >> areas you are looking at (STEM, HUM, ECON) so if you need more from a >> particular area, I'll be happy to get some. >> >> Kind regards, >> Mark Dunn >> Lead Content Architect, Oxford University Press >> >> >> >> -----Original Message----- >> From: Imsieke, Gerrit, le-tex gerrit.imsieke@xxxxxxxxx >> <jats-list-service@xxxxxxxxxxxxxxxxxxxxxx> >> Sent: 16 February 2021 17:13 >> To: jats-list@xxxxxxxxxxxxxxxxxxxxxx >> Cc: nina_linn.reinhardt@xxxxxxxxxxxxxxxxxxxx >> Subject: [jats-list] Does Blue need a Lite version, to counter its >> creeping aquafication? >> >> Dear JATS Community, >> >> As announced in a previous message to this list [1], Nina Reinhardt >> is currently working on her master's thesis in which she tries to >> find a consensus customization for the (estimated) 90% of JATS users >> that only need about half of Blue's available elements and attributes. >> >> My role in this is that I am co-supervising the thesis and that I >> came up with the idea after another discussion on this list last >> year, in which Tommie suggested that "a dozen different people (or >> small groups) each craft[ed] a 'JATS Lite' and we compare[d] them" [2]. >> >> This was our first idea: To provide a form with a list of available >> elements and attributes, and people would be able to put together >> their favorite Lite customization interactively. >> >> But then we thought that we should also offer a way for people to >> upload representative JATS content from their production or >> repositories and treat these collections as expressions of tagging >> preferences, or as "de-facto customizations". And then she skipped >> the interactive form part and focused entirely on analyzing these >> collections and which metrics are applicable to them in order to >> identify consensus customizations. >> >> Nina has written a paper in which she describes her approach and what >> is needed to find this lean consensus customization (your data!): >> https://docs.google.com/document/d/1jYDT0TkYP9Tg31Ldd9gFmdwSiu98Q2mg_qOuhgnxpRc/ >> >> >> You may skip most technical discussions for the time being and >> navigate right to the last section called "Data Collection". It is a >> call to action that asks you to donate some of your valuable JATS >> files to research. Or you can use some XSLT [3] in order to extract >> element/attribute name lists from the JATS files yourselves so you >> need not send potentially proprietary data to someone else. >> >> Please donate generously, and if possible do it by March 1st. Nina's >> thesis needs to be completed by June. >> >> You are allowed to add comments and suggestions to the Google doc, >> you may of course file issues and pull requests in the Github repo, >> and you can contact Nina and/or me via this list or direct email >> messages if you have questions or suggestions. >> >> On behalf of Nina (and myself), >> >> Gerrit >> >> [1] >> https://www.biglist.com/lists/lists.mulberrytech.com/jats-list/archives/202009/msg00019.html >> >> [2] >> https://www.biglist.com/lists/lists.mulberrytech.com/jats-list/archives/202004/msg00030.html >> >> [3] https://github.com/nreinhar/JATS_Customizing_Analysis/ >> > > -- Pieter Lamers John Benjamins Publishing Company Postal Address: P.O. Box 36224, 1020 ME AMSTERDAM, The Netherlands Visiting Address: Klaprozenweg 75G, 1033 NN AMSTERDAM, The Netherlands Warehouse: Kelvinstraat 11-13, 1446 TK PURMEREND, The Netherlands tel: +31 20 630 4747 web: www.benjamins.com
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [jats-list] Does Blue need a Li, Imsieke, Gerrit, le- | Thread | Re: [jats-list] Does Blue need a Li, Imsieke, Gerrit, le- |
[jats-list] [ANN] Balisage 2021 - C, B Tommie Usdin btusd | Date | Re: [jats-list] Does Blue need a Li, Imsieke, Gerrit, le- |
Month |