Subject: Re: [xsl] mixed content, text-based abbreviations to xml From: George Cristian Bina <george@xxxxxxxxxxxxx> Date: Fri, 06 Mar 2009 10:53:14 +0200 |
<xsl:template match="text()"> <xsl:analyze-string select="." regex="\(.+?\)"> <xsl:matching-substring> <ex><xsl:value-of select="translate(., '()', '')"/></ex> </xsl:matching-substring> <xsl:non-matching-substring> <xsl:value-of select="."/> </xsl:non-matching-substring> </xsl:analyze-string> </xsl:template> </xsl:stylesheet>
<xsl:template match="node() | @*"> <xsl:copy> <xsl:apply-templates select="node() | @*"/> </xsl:copy> </xsl:template>
<xsl:template match="text()[preceding-sibling::*[1][self::ex] and not(translate(substring(.,1,1), $marks, '')='')]"> <xsl:variable name="words" select="tokenize(., '\s')"/> <fragment><xsl:value-of select="$words[1]"/></fragment> <xsl:value-of select="substring(., string-length($words[1]) + 1)"/> </xsl:template> </xsl:stylesheet>
<xsl:template match="seg" mode="text"> <xsl:call-template name="process"/> </xsl:template> <xsl:template match="fragment" mode="text"> <xsl:value-of select="."/> </xsl:template> <xsl:template match="node() | @*" mode="text"> <xsl:copy> <xsl:apply-templates select="node() | @*" mode="text"/> </xsl:copy> </xsl:template>
<xsl:template match="ex" mode="orig"> <am/> </xsl:template> <xsl:template match="fragment" mode="orig"> <xsl:value-of select="."/> </xsl:template> <xsl:template match="supplied" mode="orig"> <damage/> </xsl:template>
<xsl:template match="ex" mode="abbr"> <am/> </xsl:template> <xsl:template match="fragment" mode="abbr"> <xsl:value-of select="."/> </xsl:template> <xsl:template match="supplied" mode="abbr"> <xsl:copy-of select="."/> </xsl:template>
<xsl:template match="fragment" mode="expan"> <xsl:value-of select="."/> </xsl:template> <xsl:template match="supplied|ex" mode="expan"> <xsl:copy-of select="."/> </xsl:template>
Best Regards, George -- George Cristian Bina <oXygen/> XML Editor, Schema Editor and XSLT Editor/Debugger http://www.oxygenxml.com
[resending after bounce message...because the mailing list doesn't like google app's different X-MAIL-FROM header...fingers crossed it is right now.]
Hiya,
I have some XML that has mixed content of markup and text nodes where I want to process certain words. The words in the the document are not already tokenized in any way (and multiple levels of nested markup ranging from the middle of words makes this difficult). What I want to do is process the individual words (some containing or embedded in markup) and where there is an expansion denoted by parentheses provide that and the abbreviated form, and if that works, then if there is a <supplied> element beginning and ending inside the word, replace that with <damage/> to provide a copy of the original.
If the content is something like:
===== <p> <lb n="1"/>In nomine Domini amen. Ne error obliuionis <supplied>geE?tis</supplied> E?ub tempore verE?antibus pariat detrimentu(m). <lb n="2"/>Conuenit, ut actus h<supplied>om</supplied>inu(m) l(itte)r<supplied>ar</supplied>(um) et teE?tium fidedignorum <seg>annotac(i)on(e)</seg> ad poE?teritatis noticiam <foo>deducantur <seg>aut int(er)dum</seg> ob</foo> scripture vetustatem renovent(ur). Ad perpetuam proinde ... </p> =====
The output should change words containing ( and ) into a nested structure such as:
input: h<supplied type="damage">om</supplied>inu(m) output: <choice> <orig>h<damage/>inu<am/></orig> <abbr>h<supplied type="damage">om</supplied>inu<am/></abbr> <expan>h<supplied type="damage">om</supplied>inu<ex>m</ex></expan> </choice>
The <orig> is only supplied here because the original word actually has a <supplied reason="damage"> element that begins/ends inside the word. (For the full example I've not included the attribute to make it more readable.) Words can contain any number of elements such as <lb/> and <supplied>, as well as the usual whitespace problems. Abbreviations denoted by parentheses are always only part of an individual word, though may occur multiple times in a word.
Full output of the above would be something like: ===== <p> <lb n="1"/>In nomine Domini amen. Ne error obliuionis <supplied>geE?tis</supplied> E?ub tempore verE?antibus pariat <choice> <abbr>detrimentu<am/></abbr> <expan>detrimentu<ex>m</ex></expan> </choice>. <lb n="2"/>Conuenit, ut actus <choice> <orig>h<damage/>inu<am/></orig> <abbr>h<supplied>om</supplied>inu<am/></abbr> <expan>h<supplied>om</supplied>inu<ex>m</ex></expan> </choice> <choice> <orig>l<am/>r<damage/><am/></orig> <abbr>l<am/>r<supplied>ar</supplied><am/></abbr> <expan>l<ex>itte</ex>r<supplied>ar</supplied><ex>um</ex></expan> </choice> et teE?tium fidedignorum <seg> <choice> <abbr>annotac<am/>on<am/></abbr> <expan>annotac<ex>i</ex>on<ex>e</ex></expan> </choice> </seg> ad poE?teritatis noticiam <foo>deducantur <seg>aut <choice> <abbr>int<am/>dum</abbr> <expan>int<ex>er</ex>dum</expan> </choice> </seg> ob</foo> scripture vetustatem <choice> <abbr>renovent<am/></abbr> <expan>renovent<ex>ur</ex></expan> </choice>. Ad perpetuam proinde ... </p> =====
The default copying-to-output, choices between things and creating the different versions of things once I have each word and its abbreviations tokenized all seems straightforward. It is getting each word, without losing any other markup, and knowing where the abbreviations are that I'm more fuzzy about. I hate asking for help before I've got very far, but it is straying into territory I'm not very familiar with. I'm guessing that this needs a multi-pass mode-based stylesheet with xsl:analyze-string to find the parentheses, but not tokenize() to find the edges of the word but, erm, maybe xsl:for-each-group? While I found individual bits of this in the FAQ I didn't find anything doing it all at once.
Any suggestions ranging from pointers in the right direction to fully-realized solutions gratefully received with promises of a pint next time you're in Oxford. ;-)
Many thanks, -James Cummings (posting from a new and silly domain name)
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
[xsl] mixed content, text-based abb, James Cummings | Thread | Re: [xsl] mixed content, text-based, James Cummings |
Re: [xsl] Effects of white space be, Nat Wilson | Date | [xsl] passing parameters to XSL,wha, himanshu padmanabhi |
Month |