Re: [xsl] mixed content, text-based abbreviations to xml

Subject: Re: [xsl] mixed content, text-based abbreviations to xml
From: James Cummings <james+xsl@xxxxxxxxxxxxxxxxx>
Date: Tue, 10 Mar 2009 14:14:35 +0000
Hi all,

Sorry it took me so long to get back to this...other things always
seem to get in the way of the fun of XSLT. ;-)

It took me awhile to get my head around it, and more precisely how to
modify it to fit the actual data rather than the simplified example I
gave.  (TEI documents with arbitrarily-deep nesting of all sorts of
elements that I want to add the <choice> structure to in order to then
use them for teaching purposes.)

So, as a record of what I did,  I changed them to add the TEI
namespace, create fragments around any elements rather than just
supplied, and then in the third stylesheet call the process template
when encountering any element in the body which has an <ex> element as
a descendent.  I still only group-adjacent'ed to elements named
'fragment', 'supplied' and 'ex' which seems to work for my files, but
I'm not entirely sure how to extend that for a more generalised
solution.  Oh, I also because pushing everything with an <ex> child
through the process template, I had to remember to copy over the
attributes at that point as well or they disappeared.

Many many thanks to George for a solution that I still don't think I
would have been able to get my head around!

-James

On Fri, Mar 6, 2009 at 8:53 AM, George Cristian Bina
<george@xxxxxxxxxxxxx> wrote:
> Hi James,
>
> You can find below 3 transformation steps that get you to the final result.
> You can eventually combine them into one stylesheet using a
micro-pipelining
> technique (putting the templates in different modes and the results in
> variables and applying templates in the next mode on the variable from the
> preceding step).
>
> The first step marks with ex the content in parantheses:
>
> step1.xsl
> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
> version="2.0">
> B <xsl:template match="* | @* | comment() | processing-instruction()">
> B  B <xsl:copy>
> B  B  B <xsl:apply-templates select="node() | @*"/>
> B  B </xsl:copy>
> B </xsl:template>
>
> B <xsl:template match="text()">
> B  B <xsl:analyze-string select="." regex="\(.+?\)">
> B  B  B <xsl:matching-substring>
> B  B  B  B <ex><xsl:value-of select="translate(., '()', '')"/></ex>
> B  B  B </xsl:matching-substring>
> B  B  B <xsl:non-matching-substring>
> B  B  B  B <xsl:value-of select="."/>
> B  B  B </xsl:non-matching-substring>
> B  B </xsl:analyze-string>
> B </xsl:template>
> </xsl:stylesheet>
>
> giving as result
>
> <?xml version="1.0" encoding="UTF-8"?><p>
> B  B <lb n="1"/>In nomine Domini amen. Ne error obliuionis
> B  B <supplied>geE?tis</supplied> E?ub tempore
> B  B verE?antibus pariat detrimentu<ex>m</ex>. <lb n="2"/>Conuenit, ut
actus
> B  B h<supplied>om</supplied>inu<ex>m</ex>
> B  B l<ex>itte</ex>r<supplied>ar</supplied><ex>um</ex> et teE?tium
fidedignorum
> B  B <seg>annotac<ex>i</ex>on<ex>e</ex></seg> ad
> B  B poE?teritatis noticiam <foo>deducantur <seg>aut
int<ex>er</ex>dum</seg>
> B  B  B  B ob</foo> scripture vetustatem
> B  B renovent<ex>ur</ex>. Ad perpetuam proinde ...
> </p>
>
> The second step marks with fragment the text before and after ex and before
> supplied
>
> step2.xsl
>
> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
> version="2.0">
> B <xsl:variable name="marks" select="'&#10;&#13;,. '"/>
>
> B <xsl:template match="node() | @*">
> B  B <xsl:copy>
> B  B  B <xsl:apply-templates select="node() | @*"/>
> B  B </xsl:copy>
> B </xsl:template>
>
> B <xsl:template match="text()[following-sibling::*[1][self::ex or
> self::supplied] and
> B  B not(translate(substring(., string-length(.)), $marks, '')='')]">
> B  B <xsl:variable name="words" select="tokenize(., '\s')"/>
> B  B <xsl:value-of select="substring(., 1,
> string-length(.)-string-length($words[last()]))"/>
> B  B <fragment><xsl:value-of select="$words[last()]"/></fragment>
> B </xsl:template>
>
> B <xsl:template match="text()[preceding-sibling::*[1][self::ex] and
> B  B not(translate(substring(.,1,1), $marks, '')='')]">
> B  B <xsl:variable name="words" select="tokenize(., '\s')"/>
> B  B <fragment><xsl:value-of select="$words[1]"/></fragment>
> B  B <xsl:value-of select="substring(., string-length($words[1]) + 1)"/>
> B </xsl:template>
> </xsl:stylesheet>
>
> giving as result
>
> <?xml version="1.0" encoding="UTF-8"?><p>
> B  B <lb n="1"/>In nomine Domini amen. Ne error obliuionis
> B  B <supplied>geE?tis</supplied> E?ub tempore
> B  B verE?antibus pariat <fragment>detrimentu</fragment><ex>m</ex>. <lb
> n="2"/>Conuenit, ut actus
>
>
<fragment>h</fragment><supplied>om</supplied><fragment>inu</fragment><ex>m</e
x>
>
>
<fragment>l</fragment><ex>itte</ex><fragment>r</fragment><supplied>ar</suppli
ed><ex>um</ex>
> et teE?tium fidedignorum
>
>
<seg><fragment>annotac</fragment><ex>i</ex><fragment>on</fragment><ex>e</ex><
/seg>
> ad
> B  B poE?teritatis noticiam <foo>deducantur <seg>aut
> <fragment>int</fragment><ex>er</ex><fragment>dum</fragment></seg>
> B  B  B  B ob</foo> scripture vetustatem
> B  B <fragment>renovent</fragment><ex>ur</ex>. Ad perpetuam proinde ...
> </p>
>
> The final step groups the adjacent fragment, supplied and ex nodes and
> outputs the choice:
>
> step3.xsl
> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
> version="2.0">
> B <xsl:template match="p|seg">
> B  B <xsl:call-template name="process"/>
> B </xsl:template>
>
> B <xsl:template name="process">
> B  B <xsl:copy>
> B  B  B <xsl:for-each-group select="node()" group-adjacent="name() =
> ('fragment','supplied','ex')">
> B  B  B  B <xsl:choose>
> B  B  B  B  B <xsl:when test="current-grouping-key() and
current-group()/name() =
> 'ex'">
> B  B  B  B  B  B <choice>
> B  B  B  B  B  B  B <xsl:if test="current-group()/name() = 'supplied'">
> B  B  B  B  B  B  B  B <orig><xsl:apply-templates select="current-group()"
> mode="orig"/></orig>
> B  B  B  B  B  B  B </xsl:if>
> B  B  B  B  B  B  B <abbr><xsl:apply-templates select="current-group()"
> mode="abbr"/></abbr>
> B  B  B  B  B  B  B <expan><xsl:apply-templates select="current-group()"
> mode="expan"/></expan>
> B  B  B  B  B  B </choice>
> B  B  B  B  B </xsl:when>
> B  B  B  B  B <xsl:otherwise>
> B  B  B  B  B  B <xsl:apply-templates select="current-group()"
mode="text"/>
> B  B  B  B  B </xsl:otherwise>
> B  B  B  B </xsl:choose>
> B  B  B </xsl:for-each-group>
> B  B </xsl:copy>
> B </xsl:template>
>
> B <xsl:template match="seg" mode="text">
> B  B <xsl:call-template name="process"/>
> B </xsl:template>
> B <xsl:template match="fragment" mode="text">
> B  B <xsl:value-of select="."/>
> B </xsl:template>
> B <xsl:template match="node() | @*" mode="text">
> B  B <xsl:copy>
> B  B  B <xsl:apply-templates select="node() | @*" mode="text"/>
> B  B </xsl:copy>
> B </xsl:template>
>
> B <xsl:template match="ex" mode="orig">
> B  B <am/>
> B </xsl:template>
> B <xsl:template match="fragment" mode="orig">
> B  B <xsl:value-of select="."/>
> B </xsl:template>
> B <xsl:template match="supplied" mode="orig">
> B  B <damage/>
> B </xsl:template>
>
> B <xsl:template match="ex" mode="abbr">
> B  B <am/>
> B </xsl:template>
> B <xsl:template match="fragment" mode="abbr">
> B  B <xsl:value-of select="."/>
> B </xsl:template>
> B <xsl:template match="supplied" mode="abbr">
> B  B <xsl:copy-of select="."/>
> B </xsl:template>
>
> B <xsl:template match="fragment" mode="expan">
> B  B <xsl:value-of select="."/>
> B </xsl:template>
> B <xsl:template match="supplied|ex" mode="expan">
> B  B <xsl:copy-of select="."/>
> B </xsl:template>
>
> </xsl:stylesheet>
>
> giving the result you expect
>
> <?xml version="1.0" encoding="UTF-8"?><p>
> B  B <lb n="1"/>In nomine Domini amen. Ne error obliuionis
> B  B <supplied>geE?tis</supplied> E?ub tempore
> B  B verE?antibus pariat
>
<choice><abbr>detrimentu<am/></abbr><expan>detrimentu<ex>m</ex></expan></choi
ce>.
> <lb n="2"/>Conuenit, ut actus
>
>
<choice><orig>h<damage/>inu<am/></orig><abbr>h<supplied>om</supplied>inu<am/>
</abbr><expan>h<supplied>om</supplied>inu<ex>m</ex></expan></choice>
>
>
<choice><orig>l<am/>r<damage/><am/></orig><abbr>l<am/>r<supplied>ar</supplied
><am/></abbr><expan>l<ex>itte</ex>r<supplied>ar</supplied><ex>um</ex></expan>
</choice>
> et teE?tium fidedignorum
>
>
<seg><choice><abbr>annotac<am/>on<am/></abbr><expan>annotac<ex>i</ex>on<ex>e<
/ex></expan></choice></seg>
> ad
> B  B poE?teritatis noticiam <foo>deducantur <seg>aut
>
<choice><abbr>int<am/>dum</abbr><expan>int<ex>er</ex>dum</expan></choice></se
g>
> B  B  B  B ob</foo> scripture vetustatem
>
>
<choice><abbr>renovent<am/></abbr><expan>renovent<ex>ur</ex></expan></choice>
.
> Ad perpetuam proinde ...
> </p>
>
> Best Regards,
> George
> --
> George Cristian Bina
> <oXygen/> XML Editor, Schema Editor and XSLT Editor/Debugger
> http://www.oxygenxml.com
>
> James Cummings wrote:
>>
>> [resending after bounce message...because the mailing list doesn't
>> like google app's different X-MAIL-FROM header...fingers crossed it is
>> right now.]
>>
>> Hiya,
>>
>> I have some XML that has mixed content of markup and text nodes where
>> I want to process certain words. B The words in the the document are
>> not already tokenized in any way (and multiple levels of nested markup
>> ranging from the middle of words makes this difficult). B What I want
>> to do is process the individual words (some containing or embedded in
>> markup) and where there is an expansion denoted by parentheses provide
>> that and the abbreviated form, and if that works, then if there is a
>> <supplied> element beginning and ending inside the word, replace that
>> with <damage/> to provide a copy of the original.
>>
>> If the content is something like:
>>
>> =====
>> <p>
>> B  <lb n="1"/>In nomine Domini amen. Ne error obliuionis
>> <supplied>geE?tis</supplied> E?ub tempore
>> B  verE?antibus pariat detrimentu(m). <lb n="2"/>Conuenit, ut actus
>> h<supplied>om</supplied>inu(m)
>> B  B  B  l(itte)r<supplied>ar</supplied>(um) et teE?tium fidedignorum
>> <seg>annotac(i)on(e)</seg> ad
>> B  poE?teritatis noticiam <foo>deducantur <seg>aut int(er)dum</seg>
>> ob</foo> scripture vetustatem
>> B  renovent(ur). Ad perpetuam proinde ...
>> </p>
>> =====
>>
>> The output should change words containing ( and ) into a nested
>> structure such as:
>>
>> input: h<supplied type="damage">om</supplied>inu(m)
>> output:
>> <choice>
>> B  <orig>h<damage/>inu<am/></orig>
>> B  <abbr>h<supplied type="damage">om</supplied>inu<am/></abbr>
>> B  <expan>h<supplied type="damage">om</supplied>inu<ex>m</ex></expan>
>> </choice>
>>
>> The <orig> is only supplied here because the original word actually
>> has a <supplied reason="damage"> element that begins/ends inside the
>> word. (For the full example I've not included the attribute to make it
>> more readable.) B Words can contain any number of elements such as
>> <lb/> and <supplied>, as well as the usual whitespace problems.
>> Abbreviations denoted by parentheses are always only part of an
>> individual word, though may occur multiple times in a word.
>>
>> Full output of the above would be something like:
>> =====
>> <p>
>> B  <lb n="1"/>In nomine Domini amen. Ne error obliuionis
>> <supplied>geE?tis</supplied> E?ub tempore
>> B  verE?antibus pariat <choice>
>> B  B  B  <abbr>detrimentu<am/></abbr>
>> B  B  B  <expan>detrimentu<ex>m</ex></expan>
>> B  </choice>. <lb n="2"/>Conuenit, ut actus <choice>
>> B  B  B  <orig>h<damage/>inu<am/></orig>
>> B  B  B  <abbr>h<supplied>om</supplied>inu<am/></abbr>
>> B  B  B  <expan>h<supplied>om</supplied>inu<ex>m</ex></expan>
>> B  </choice>
>> B  <choice>
>> B  B  B  <orig>l<am/>r<damage/><am/></orig>
>> B  B  B  <abbr>l<am/>r<supplied>ar</supplied><am/></abbr>
>> B  B  B  <expan>l<ex>itte</ex>r<supplied>ar</supplied><ex>um</ex></expan>
>> B  </choice> et teE?tium fidedignorum <seg>
>> B  B  B  <choice>
>> B  B  B  B  B  <abbr>annotac<am/>on<am/></abbr>
>> B  B  B  B  B  <expan>annotac<ex>i</ex>on<ex>e</ex></expan>
>> B  B  B  </choice>
>> B  </seg> ad poE?teritatis noticiam <foo>deducantur <seg>aut <choice>
>> B  B  B  B  B  B  B  <abbr>int<am/>dum</abbr>
>> B  B  B  B  B  B  B  <expan>int<ex>er</ex>dum</expan>
>> B  B  B  B  B  </choice>
>> B  B  B  </seg> ob</foo> scripture vetustatem <choice>
>> B  B  B  <abbr>renovent<am/></abbr>
>> B  B  B  <expan>renovent<ex>ur</ex></expan>
>> B  B  B  </choice>. Ad perpetuam proinde ...
>> </p>
>> =====
>>
>> The default copying-to-output, choices between things and creating the
>> different versions of things once I have each word and its
>> abbreviations tokenized all seems straightforward. B It is getting each
>> word, without losing any other markup, and knowing where the
>> abbreviations are that I'm more fuzzy about. B I hate asking for help
>> before I've got very far, but it is straying into territory I'm not
>> very familiar with. B I'm guessing that this needs a multi-pass
>> mode-based stylesheet with xsl:analyze-string to find the parentheses,
>> but not tokenize() to find the edges of the word but, erm, maybe
>> xsl:for-each-group? While I found individual bits of this in the FAQ I
>> didn't find anything doing it all at once.
>>
>> Any suggestions ranging from pointers in the right direction to
>> fully-realized solutions gratefully received with promises of a pint
>> next time you're in Oxford. ;-)
>>
>> Many thanks,
>> -James Cummings
>> (posting from a new and silly domain name)

Current Thread