Re: [jats-list] Strange top-level elements permitted in the RNG version of Blue

Subject: Re: [jats-list] Strange top-level elements permitted in the RNG version of Blue
From: "Imsieke, Gerrit, le-tex gerrit.imsieke@xxxxxxxxx" <jats-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Sun, 27 Sep 2020 20:03:46 -0000
Hi Debbie,

Thanks for confirming it.

I stumbled across this because with your and the community's support, we will be conducting a study about JATS customizations. It will be the subject of Nina Reinhardt's master's thesis. Nina has already obtained a B.Sc. in the field of Book & Media Production at HTWK Leipzig, and she is working part-time at our shop checking and correcting (supposedly) automatic docx to NISO STS conversions. She was loking for a thesis topic while I had still had in mind what was discussed 5 months ago: https://www.biglist.com/cgi-bin/wilma/wilma_hiliter/jats-list@xxxxxxxxxxxxxxxxxxxxxx/202004/msg00030.html
I thought this could be the topic that she examines.


There is always the tendency to aquafy Blue by adding more Green elements, and there is pushback by people who want to have a customization that restricts choices wrt the publishing schema (Blue).

Our plan is to collect people's preferred customizations, maybe by means of an online configuration tool but certainly by analyzing samples of their JATS files if they submit them or analyze them privately, and see whether one of their express or implicit customizations better suits the needs of the other users than, for example, off-the-shelf Blue.

We plan to do this by calculating the distances between each pair of customizations, also taking into account how well a given customization will serve as the basis for another: We strive to minimize the distance (elements/attributes/attribute values that need to be added to schema A in order to be able to reach schema B, plus elements/attributes/attribute values that need to be added to schema B in order to be able to reach schema A), but at the same time favor a schema that already is or almost is a superset of the other. Defining an appropriate metric not only for the "edit distance" between two schemas but also for the aptitude of one customization to serve as the customization basis of another customization (favoring supersets, that is, favoring element removals over element additions) is one of the key outcomes that we hope this thesis is going to produce. Within this metric, the baptitudeb should be better for a schema that has less fat to cut away than another candidate schema, while still serving as a superset to the derived schema.

in the future, maybe in 2 months' time, we will ask the JATS community to either upload samples of articles or to analyze these articles themselves, using XSLT tools that we provide and that may work in the browser (without the need to upload content to our servers), so that we can identify de-facto customizations, that is, the subsets and the extensions to the off-the-shelf JATS customizations that these users apply to their content in practice.

Then we will try to identify the sweet-spot customization that is suitable for 90% of the publishing content but that uses only, say, 67% of the current Blue tag set.

Sorry for the lengthy message. Of course we'd like to collect feedback at this early stage. This survey will only be viable if many publishers submit/analyze samples of their production JATS XML. So we'll be glad if you can indicate whether your organization is interested in participating, and whether you need further endorsements or non-disclosure guarantees in order to proceed. Also whether it is or isn't important that the analysis can be carried out privately without sharing the underlying JATS files.

I set up this repo for preliminary analysis of the RNG versions of the existing customizations: https://github.com/gimsieke/JATS_Customizing_Analysis
The whole study will be carried out in the open, and Nina and/or I will hopefully present meaningful results at one of the upcoming JATSCon events.


Gerrit


On 27.09.2020 20:41, Debbie Lapeyre dalapeyre@xxxxxxxxxxxxxxxx wrote:
Sorry about the odd start elements.

We make the RNG automatically, but the start elements need
to be edited by hand, and I obviously forgot.

I will try to do better next round.

--Debbie

On Sep 26, 2020, at 9:37 AM, Jacques Legare jlegare@xxxxxxxxx <jats-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote:

Glad I could help.

Your project sounds interesting!

On Sat, Sep 26, 2020 at 8:41 AM Imsieke, Gerrit, le-tex gerrit.imsieke@xxxxxxxxx <jats-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote:
Ah ok, thank you! Then maybe someone already removed them manually for
Pumpkin.

I noticed this because I wrote some XSLT that identifies the *reachable*
elements in the RNG versions of the customizations. They contain all the
models that the original DTD provides, and there is no information
retained in the RNG which model was in fact activated or deactivated by
means of parameter entities in the DTD.

Gerrit

On 26.09.2020 14:33, Jacques Legare jlegare@xxxxxxxxx wrote:
I can't be sure in this specific case, but I know this happens if a
Relax NG schema is generated using TRANG and there are unreachable
elements in the DTD. They end up getting thrown into the start set for
the schema.

On Sat, Sep 26, 2020 at 6:14 AM Imsieke, Gerrit, le-tex
gerrit.imsieke@xxxxxxxxx <mailto:gerrit.imsieke@xxxxxxxxx>
<jats-list-service@xxxxxxxxxxxxxxxxxxxxxx
<mailto:jats-list-service@xxxxxxxxxxxxxxxxxxxxxx>> wrote:

     I discovered that the Relax NG schemas of the Publishing (Blue)
     customization permit top-level elements that are not part of Blue:

     <start>
         <choice>
           <ref name="article"/>
           <ref name="rp"/>
           <ref name="overline-start"/>
           <ref name="unstructured-kwd-group"/>
           <ref name="underline-end"/>
           <ref name="underline-start"/>
           <ref name="x"/>
           <ref name="overline-end"/>
         </choice>
     </start>

In the Green and Pumpkin RNG schemas, there's only this:

     <start>
         <choice>
           <ref name="article"/>
         </choice>
     </start>

     I haven't checked all 4 variants of each customization, I only took two
     samples each.

Is there a reason for this, or did this happen by mistake?

Gerrit

Current Thread