Re: [jats-list] Markup for linguistics (glossed text)

Subject: Re: [jats-list] Markup for linguistics (glossed text)
From: Wendell Piez <wapiez@xxxxxxxxxxxxxxx>
Date: Mon, 25 Nov 2013 10:19:28 -0500
Hi again,

Yes, I know - one person's hobby is another person's dark art. I
confess I have a weakness for homemade markup languages, but that's
partly because exposing semantics with tags is the best way I know of
asking what is this data and what does one want to do with it. (I.e. I
think of element types the way CSS developers think of @class
assignments. Oh! but I get to validate. :-) That's the real question
here (IMHO), not the syntax. Whether then one goes on to codify this
using a profile of tagging that is already out there (such as JATS
p/named-content, Ruby, SVG tspan/@class, your own tags or what have
you), and exactly how you jigger your CSS once you get there (if
that's what you are doing), are to me questions of implementation --
related, and conditioning, but not the same.

So: what do you need and want to do with this data? In particular,
what do the consumers of the data want and what are they prepared to
deal with? What is the scale (are there five, five hundred, fifty
thousand instances)? Time frame? Long-term potentials for semantic
description? Are you able to run transformations before delivery, or
do you have to settle with whatever your clients already know how to
do? If the latter, your options are much more limited. There is room
in this world for SVG as a quick way to get to JPEG. There's even room
for Adobe Illustrator.

No matter how you approach it, you're going to have the same data
control/validation issues, only more or less exposed. If you are
content to validate this by eye or "in the application", and you know
you'll never see any more of these critters, then using whatever
syntax is available (however cumbersome or ornate) may be better than
working with a clean syntax. But if there are lots of these, or they
are very complex, then a schema can be worthwhile for better control.

My point is there isn't one way to do it. I'm certainly not against
JATS- or TEI- or HTML-based approaches, even creative ones. (See
http://www.balisage.net/Proceedings/vol7/html/Piez01/BalisageVol7-Piez01.html
.)
I just don't think of them as complete solutions in themselves.

As long as we are being creative, how about good-old-fashioned JATS
def-list/def-item/(term,def) with @specific-use? The logic to get this
into HTML/CSS (whether flowing to inline blocks or tables) would be
even easier than the conversion from Ruby (since both 'terms' and
'defs' provide structure). And it's even in JATS today.

Cheers, Wendell
Wendell Piez | http://www.wendellpiez.com
XML | XSLT | electronic publishing
Eat Your Vegetables
_____oo_________o_o___ooooo____ooooooo_^


On Fri, Nov 22, 2013 at 6:48 PM, Imsieke, Gerrit, le-tex
<gerrit.imsieke@xxxxxxxxx> wrote:
> How would you do that in TEI?
>
> Maybe there is no canonical way either, but at least there should be one or
> more recommended ways. Encoding stuff like that is TEIbs core business,
> isnbt it?
>
> Do you encode the aligned segments in separate paragraphs, with links
> between the corresponding segments [1]? This could be either links from the
> base segment to its annotation or the other way round, or in a linkGrp [2].
> Plus, add some semantic information about what is the base and what is the
> annotation.
>
> Ibm not sure, Ibm a dabbler in TEI as much as in JATS. But if the JATS
> family of markup dialects [3] used this kind of correspondence linking, how
> would it translate to its own vocabulary?
>
> Maybe something along these lines:
>
> <p><named-content content-type="base"
> id="id1"><italic>Siu-ti</italic></named-content>
>   <named-content content-type="base"
> id="id2"><italic><bold>i</bold>-najyen-b&</italic></named-content> b&</p>
> <p><named-content content-type="annot" rid="id1"><styled-content
>
>
style-type="small-caps"><italic>syu</italic>-comp</styled-content></named-con
tent>
>   <named-content content-type="annot"
> rid="id2"><bold>2O</bold>-b&</named-content></p>
>
> Having read Chris Maloneybs recent message on this topic, I agree that
there
> shouldnbt probably be anything tabular in the markup.
>
> Whether to use ruby or named-content is more a matter of taste then. Except
> when you have multiple levels of annotation. Then named-content, id, and
rid
> are more versatile.
>
> Gerrit
>
> [1] http://www.tei-c.org/release/doc/tei-p5-doc/en/html/SA.html#SACS
> [2] http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-linkGrp.html
> [3] Linguists will probably dispute that NLM, BITS, JATS are *dialects*,
and
> insist on theybre something else within their onotology.
>
>
> On 22.11.2013 21:44, Wendell Piez wrote:
>>
>> Hi again,
>>
>> Sorry I take it back: since the line breaks in the samples appear to
>> arbitrary, 'ruby' might be a better choice after all (although this is
>> also a "creative" use of Ruby, which has generally been for
>> phonological transcription AFAIK) than tables. Still not as fun as
>> your own markup.
>>
>> Cheers, Wendell
>>
>> Wendell Piez | http://www.wendellpiez.com
>> XML | XSLT | electronic publishing
>> Eat Your Vegetables
>> _____oo_________o_o___ooooo____ooooooo_^
>>
>>
>> On Fri, Nov 22, 2013 at 3:20 PM, Wendell Piez <wapiez@xxxxxxxxxxxxxxx>
>> wrote:
>>>
>>> Hi again,
>>>
>>> Also, I'd prefer plain-old tables (however ornate) to 'ruby' following
>>> the "Principle of Least Surprise".
>>>
>>> Cheers, Wendell
>>>
>>> Wendell Piez | http://www.wendellpiez.com
>>> XML | XSLT | electronic publishing
>>> Eat Your Vegetables
>>> _____oo_________o_o___ooooo____ooooooo_^
>>>
>>>
>>> On Fri, Nov 22, 2013 at 2:56 PM, Wendell Piez <wapiez@xxxxxxxxxxxxxxx>
>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> My nominations for alternatives:
>>>>
>>>> (1) If there are a lot of these, and real benefit to be gained, then
>>>> design and use a little markup language for them. Then, format as you
>>>> like, probably via tables.
>>>>
>>>> Disadvantage: time and expertise required. Dependence on specialists'
>>>> knowhow. (But that could be an advantage.)
>>>>
>>>> (2) Custom-designed tables, validated via Schematron. JATS provides
>>>> @content-type
>>>> Just as much work, and you'd be doing all the same work as (1), but
>>>> they could be made to validate as JATS without extending it.
>>>>
>>>> Advantage: relatively quick and dirty to get something started.
>>>> Disadvantage: the XML would be relatively hard to maintain compared to
>>>> (1). Also, this is schema design without a schema, so relatively
>>>> fragile and not scalable to complexity.
>>>>
>>>> (Such a table could also be used to represent (1) in JATS when
>>>> interfacing with JATS-based systems.)
>>>>
>>>> (3) SVG. Similar disadvantages, many advantages of its own. They could
>>>> be very pretty. :-)
>>>>
>>>> It sounds like graphics made from SVGs might be the preferred choice
>>>> of your vendor (and I don't blame them). But as Debbie points out,
>>>> they're not searchable. (If the SVGs were available they'd be sort of
>>>> searchable.)
>>>>
>>>> What my choice would be would depend on my goals, long-term and
>>>> short-term resources, and the frequency with which it occurs or number
>>>> of them. Having a finite number of these things (i.e. I'd never expect
>>>> to see more of these than I already have) or having them very
>>>> infrequently would argue for (2) or (3). The more of these there are
>>>> and the more interesting/important the semantics they could expose,
>>>> the more I'd do (1).
>>>>
>>>> Designing and specifying a well-controlled, clean descriptive format
>>>> (1) would also be really fun. (2) and (3) are also natural spin-offs
>>>> for (1), not exclusive of it -- although you could also skip to them
>>>> directly (and specialists in CSS and SVG might prefer to do so).
>>>>
>>>> Cheers, Wendell
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Wendell Piez | http://www.wendellpiez.com
>>>> XML | XSLT | electronic publishing
>>>> Eat Your Vegetables
>>>> _____oo_________o_o___ooooo____ooooooo_^
>>>>
>>>>
>>>> On Thu, Nov 21, 2013 at 5:01 PM, Michael Boudreau
>>>> <mboudreau@xxxxxxxxxxxxxxxxxx> wrote:
>>>>>
>>>>> For what it's worth, our hosting platform informs me that the only way
>>>>> to
>>>>> get these images to display at a consistent size is to submit the
>>>>> <graphic> element as a child of <disp-formula>. They were not
>>>>> sympathetic
>>>>> to my pointing out that these are not math.
>>>>>
>>>>> --
>>>>> Michael R. Boudreau
>>>>> Electronic Publishing Technology Manager
>>>>> The University of Chicago Press
>>>>> 1427 E. 60th Street
>>>>> Chicago, IL 60637
>>>>> (773) 753-3298
>>>>> www.journals.uchicago.edu
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 11/20/13, 10:56 AM, "Michael Boudreau"
>>>>> <mboudreau@xxxxxxxxxxxxxxxxxx>
>>>>> wrote:
>>>>>
>>>>>> Thanks, everyone, for these comments. I should have mentioned that
>>>>>> we're
>>>>>> currently using graphics, like so (highly simplified):
>>>>>>
>>>>>>    <p>Some text precedes an example:</p>
>>>>>>    <p><graphic href="example1.tiff"/></p>
>>>>>>    <p>And the text continues.</p>
>>>>>>
>>>>>> This can be converted by our host to a readable HTML presentation. The
>>>>>> down-side is that the content of the graphic is not searchable by the
>>>>>> user's browser (though the site's search engine can build its index
>>>>>> from
>>>>>> the PDF version, so all is not lost), and the graphic's visual quality
>>>>>> is
>>>>>> relatively low, particularly on mobile devices.
>>>>>>
>>>>>> To answer Nikos's question, I don't have a current project that
>>>>>> requires a
>>>>>> particular type of markup for such examples, but the examples in their
>>>>>> context just don't strike me as "tabular"--but I'm not a linguist and
>>>>>> would defer to the journal editors if they deemed table markup
>>>>>> appropriate. I think <ruby> is closer to the mark; I'd have to do
>>>>>> extensive testing to see if it could handle examples with multiple
>>>>>> layers
>>>>>> of glossing on the base text (sometimes there are 2 or 3 or more). (I
>>>>>> tremble to think what it would take to train our typesetting vendors
>>>>>> to
>>>>>> apply either <table> or <ruby> markup to these examples.)
>>>>>>
>>>>>> I hadn't thought of <array>, which actually might help solve a
>>>>>> processing
>>>>>> problem on our vendor's side even while still using <graphic>.
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Michael R. Boudreau
>>>>>> Electronic Publishing Technology Manager
>>>>>> The University of Chicago Press
>>>>>> 1427 E. 60th Street
>>>>>> Chicago, IL 60637
>>>>>> (773) 753-3298
>>>>>> www.journals.uchicago.edu
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 11/20/13, 9:14 AM, "Alexander Schwarzman" <aschwarzman@xxxxxxxxx>
>>>>>> wrote:
>>>>>>
>>>>>>> Or, perhaps, use <array>, with either <graphic>, as Nikos suggested,
>>>>>>> or with <tbody> inside...
>>>>>>>
>>>>>>> --Sasha
>>>>>>>
>>>>>>> Alexander ('Sasha') Schwarzman, Content Technology Architect
>>>>>>> phone: +1.202.416.1979 | e-mail: aschwarzman@xxxxxxx
>>>>>>>
>>>>>>> The Optical Society (OSA)
>>>>>>> 2010 Massachusetts Ave., NW
>>>>>>> Washington, DC 20036 USA
>>>>>>> www.osa.org
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Nov 20, 2013 at 5:01 AM, Nikos Markantonatos
>>>>>>> <nikos@xxxxxxxxxx>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi Michael,
>>>>>>>>
>>>>>>>> The question that arises of course out of the "semantically
>>>>>>>> reasonable"
>>>>>>>> encoding of such difficult pieces of text is why you need it. Are
>>>>>>>> you
>>>>>>>> planning to draw some logic across different types of such
>>>>>>>> linguistic
>>>>>>>> representations? In that case, JATS alone will hardly offer you a
>>>>>>>> solution.
>>>>>>>> JATS often resorts to other known standards for the representation
>>>>>>>> of
>>>>>>>> "tough" textual pieces, such as mathematical equations (MathML) and
>>>>>>>> tables
>>>>>>>> (XHTML, OASIS). If there was a corresponding XML encoding standard
>>>>>>>> for
>>>>>>>> linguistic representations, one could make the case for embedding it
>>>>>>>> into
>>>>>>>> JATS.
>>>>>>>>
>>>>>>>> Otherwise, you are left to choose between the encoding options
>>>>>>>> suggested by
>>>>>>>> Debbie, or to capture it as an image (my favorite option), or even
>>>>>>>> attempt
>>>>>>>> to represent it in TeX/LaTeX or MathML.
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> Nikos Markantonatos
>>>>>>>> Atypon
>>>>>>>>
>>>>>>>>
>>>>>>>> On 11/19/2013 11:47 PM, Debbie Lapeyre wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Dear Michael--
>>>>>>>>>
>>>>>>>>> Ouch! No you are not overlooking anything obvious. The problem
>>>>>>>>> is that, although you ask for "semantically reasonable", you
>>>>>>>>> really want presentation markup. JATS does not do presentation,
>>>>>>>>> by design or very well.
>>>>>>>>>
>>>>>>>>>    - My first thought is a table, which this certainly looks like
>>>>>>>>>      to me. But I do see your problem.
>>>>>>>>>
>>>>>>>>>    - If it has to present EXACTLY this way, another obvious
>>>>>>>>>      (but less than perfect) choice is <preformat>. That would
>>>>>>>>>       - force this into a monofont (sorry about that)
>>>>>>>>>       - preserve all your alignments and whitespace
>>>>>>>>>       - let you include the italics, bold, and stuff.
>>>>>>>>>
>>>>>>>>>    - Another possibility (not in NLM 3.0, but in the brand new
>>>>>>>>>      JATS 1.1d1) is using <ruby>, which has a base (<rb>) and a
>>>>>>>>>      ruby text annotation (rt) traditionally displayed atop the
>>>>>>>>>      base (rt), or inside parenthesis after the base for browsers
>>>>>>>>>      that cannot handle Ruby. Ruby is part of HTML5, as well as
>>>>>>>>>      part of JATS. Ruby markup is intended for textual annotation,
>>>>>>>>>      and might fit this case very well.
>>>>>>>>>
>>>>>>>>> But I've got to tell you, I found this example incredibly hard to
>>>>>>>>> human parse and be sure what went with what and why were these 2
>>>>>>>>> clusters parallel and that one all alone? When the top line and the
>>>>>>>>> bottom line both had values, I was fine, but sometimes... Whatever
>>>>>>>>> you decide, a few horizontal lines or just more white space between
>>>>>>>>> the lines and/or less between the line and its gloss, would help
>>>>>>>>> me to separate.
>>>>>>>>>
>>>>>>>>> --Debbie
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Nov 19, 2013, at 4:17 PM, Michael Boudreau
>>>>>>>>> <mboudreau@xxxxxxxxxxxxxxxxxx> wrote:
>>>>>>>>>
>>>>>>>>>> Greetings,
>>>>>>>>>>
>>>>>>>>>> Has anyone tackled the problem of marking up textual illustrations
>>>>>>>>>> that
>>>>>>>>>> require multiple points of vertical alignment--the sort of thing
>>>>>>>>>> for
>>>>>>>>>> which
>>>>>>>>>> youDd set tab stops on a typewriter or word processor?
>>>>>>>>>>
>>>>>>>>>> IDm working on a linguistics journal that has lots of glossed
text
>>>>>>>>>> illustrations that are typeset like the items labeled (3) and (4)
>>>>>>>>>> on
>>>>>>>>>> this
>>>>>>>>>> page image:
>>>>>>>>>>
>>>>>>>>>>     http://mss.uchicago.edu:81/mrb/linguistics.png
>>>>>>>>>>
>>>>>>>>>> WeDre using the NLM Journal Publishing 3.0 DTD, and IDm at a
loss
>>>>>>>>>> for
>>>>>>>>>> a
>>>>>>>>>> markup solution that seems semantically reasonable and illustrates
>>>>>>>>>> the
>>>>>>>>>> relationships between the chunks of text that the typesetting
>>>>>>>>>> makes
>>>>>>>>>> obvious. IDve considered table markup, but I donDt want to break
a
>>>>>>>>>> single
>>>>>>>>>> sentence or other unit of meaning into multiple table cells across
>>>>>>>>>> a
>>>>>>>>>> row.
>>>>>>>>>> When I consider how our online host would convert XML into HTML, I
>>>>>>>>>> see
>>>>>>>>>> only the same bad option.
>>>>>>>>>>
>>>>>>>>>> Am I overlooking something obvious?
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Michael R. Boudreau
>>>>>>>>>> Electronic Publishing Technology Manager
>>>>>>>>>> The University of Chicago Press
>>>>>>>>>> 1427 E. 60th Street
>>>>>>>>>> Chicago, IL 60637
>>>>>>>>>> (773) 753-3298
>>>>>>>>>> www.journals.uchicago.edu
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ================================================================
>>>>>>>>> Deborah A Lapeyre              mailto:dalapeyre@xxxxxxxxxxxxxxxx
>>>>>>>>> Mulberry Technologies, Inc.      http://www.mulberrytech.com
>>>>>>>>> 17 West Jefferson Street         Phone: 301-315-9631 (USA)
>>>>>>>>> Suite 207                        Fax:   301-315-8385
>>>>>>>>> Rockville, MD 20850
>>>>>>>>> ----------------------------------------------------------------
>>>>>>>>> Mulberry Technologies: Consultancy for XML, XSLT, and Schematron
>>>>>>>>> ================================================================
>>
>
> --
> Gerrit Imsieke
> GeschC$ftsfC<hrer / Managing Director
> le-tex publishing services GmbH
> Weissenfelser Str. 84, 04229 Leipzig, Germany
> Phone +49 341 355356 110, Fax +49 341 355356 510
> gerrit.imsieke@xxxxxxxxx, http://www.le-tex.de
>
> Registergericht / Commercial Register: Amtsgericht Leipzig
> Registernummer / Registration Number: HRB 24930
>
> GeschC$ftsfC<hrer: Gerrit Imsieke, Svea Jelonek,
> Thomas Schmidt, Dr. Reinhard VC6ckler

Current Thread