Re: [xsl] How to split text element to separate spans?

Subject: Re: [xsl] How to split text element to separate spans?
From: Israel Viente <israel.viente@xxxxxxxxx>
Date: Tue, 8 Jun 2010 12:42:03 +0300
Liam and Gerrit: Thank you very much for your input,ideas and explanations.
I have many things to catch up in XSLT in order to understand this
code, but I'll try.
Thanks again, Israel


On Tue, Jun 8, 2010 at 2:28 AM, Imsieke, Gerrit, le-tex
<gerrit.imsieke@xxxxxxxxx> wrote:
> Dear Israel,
>
> I once wrote a generic splitting routine where you can split at arbitrary
> XPath expressions, at any depth. It uses saxon:evaluate, though, and is too
> complicated to be instructive here. So I tried to simplify it, below.
>
> Let's consider this input:
>
> =========8<-------------------
>
> <?xml version="1.0" encoding="utf-8"?>
> <doc>
> <p dir="ltr"><span class="smaller">text1
> B  B  B  B  B  B <br />
> B  B  B  B  B  B  text2
> B  B  B  B  B  B text3.
> B  B  B  B  B  B <br />
> B  B  B  B  B  B </span> <span class="smalleritalic">no</span> <span
> class="smaller">problems.
> B  B  B  B  B  B <br />
>
>
> B  B  B  B  B  B <br /></span></p>
>
> <p dir="ltr"><br/><span class="smaller">text1
> B  B  B  B  B  B <br />
> B  B  B  B  B  B  <span class="reallytiny">text2 <br /></span>
> B  B  B  B  B  B text3.
> B  B  B  B  B  B <br />
> B  B  B  B  B  B </span> <span class="smalleritalic">no</span> <span
> class="smaller">problems.
> B  B  B  B  B  B <br />
>
>
> B  B  B  B  B  B <br /></span></p>
>
> <p dir="ltr"> B <span class="regular">"What else?"</span></p>
> </doc>
>
> =========8<-------------------
>
> The first p contains your original input, the second p contains a br within
> *nested* spans (and a br immediately below p), and the third one doesn't
> contain a br.
>
> Applying the stylesheet quoted below, we'll arrive at this output:
>
> =========8<-------------------
>
> <?xml version="1.0" encoding="UTF-8"?><doc>
> <p dir="ltr"><span class="smaller">text1
> B  B  B  B  B  B </span><br/><span class="smaller">
> B  B  B  B  B  B  text2
> B  B  B  B  B  B text3.
> B  B  B  B  B  B </span><br/><span class="smaller">
> B  B  B  B  B  B </span> <span class="smalleritalic">no</span> <span
> class="smaller">problems.
> B  B  B  B  B  B </span><br/><span class="smaller">
>
>
> B  B  B  B  B  B </span><br/></p>
>
> <p dir="ltr"><br/><span class="smaller">text1
> B  B  B  B  B  B </span><br/><span class="smaller">
> B  B  B  B  B  B  <span class="reallytiny">text2 </span></span><br/><span
> class="smaller">
> B  B  B  B  B  B text3.
> B  B  B  B  B  B </span><br/><span class="smaller">
> B  B  B  B  B  B </span> <span class="smalleritalic">no</span> <span
> class="smaller">problems.
> B  B  B  B  B  B </span><br/><span class="smaller">
>
>
> B  B  B  B  B  B </span><br/></p>
>
> <p dir="ltr"> B <span class="regular">"What else?"</span></p>
> </doc>
>
> =========8<-------------------
>
> You might find it dissatisfying that the XML code doesn't look as
> pretty-printed as your desired output. In order to arrive at an output as
> neat as specified, you will need to apply three more passes of whitespace
> extraction/normalization (left, right, middle) to the top-level spans. If
> you really have to pretty-print the XML in such a way, I will send you the
> complete stylesheet.
>
> So here's the version that does just the splitting:
>
> =========8<-------------------
>
> <?xml version="1.0" encoding="utf-8"?>
> <xsl:transform
> B xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
> B xmlns:my="my"
> B version="2.0"
> B exclude-result-prefixes="my">
>
> B <xsl:output method="xml" indent="no" />
>
> B <!-- Default identity transform: -->
> B <xsl:template match="@* | *">
> B  B <xsl:copy>
> B  B  B <xsl:apply-templates select="@* | node()"/>
> B  B </xsl:copy>
> B </xsl:template>
>
> B <xsl:template match="p/span">
> B  B <xsl:sequence select="my:split-at-br(.)"/>
> B </xsl:template>
>
>
> B <!-- split-at-br is intended for
> B  B  B  B  B <p>foo<br/>bar</p>
> B  B  B  -> <p>foo</p><br/><p>bar</p> -->
> B <xsl:function name="my:split-at-br" as="element(*)+">
> B  B <xsl:param name="top" as="element(*)" />
> B  B <!-- group adjacent leaves (text nodes, empty elements) which are not
br:
> -->
> B  B <xsl:for-each-group
> B  B  B select="$top//node()[ count(node()) = 0 ]"
> B  B  B group-adjacent="not(self::br)">
> B  B  B <xsl:choose>
> B  B  B  B <xsl:when test="current-grouping-key()">
> B  B  B  B  B <!-- output the top element and its subtree, restricted to
> B  B  B  B  B  B  B  all ancestors of the current leaf group and the current
leaf
> group itself: -->
> B  B  B  B  B <xsl:apply-templates select="$top" mode="split">
> B  B  B  B  B  B <xsl:with-param name="restricted-to"
select="current-group()"
> tunnel="yes"/>
> B  B  B  B  B </xsl:apply-templates>
> B  B  B  B </xsl:when>
> B  B  B  B <xsl:otherwise>
> B  B  B  B  B <br/>
> B  B  B  B </xsl:otherwise>
> B  B  B </xsl:choose>
> B  B </xsl:for-each-group>
> B </xsl:function>
>
> B <xsl:template match="*" mode="split">
> B  B <xsl:param name="restricted-to" as="node()*" tunnel="yes"/>
> B  B <!-- Only process this element if it's within the restriction group
> B  B  B  B  or its members' ancestors: -->
> B  B <xsl:if test="generate-id(.) = (
> B  B  B  B  B  B  B  B  B  B for $n in $restricted-to
> B  B  B  B  B  B  B  B  B  B return (
> B  B  B  B  B  B  B  B  B  B  B for $a in $n/ancestor-or-self::*
> B  B  B  B  B  B  B  B  B  B  B return generate-id($a)
> B  B  B  B  B  B  B  B  B  B )
> B  B  B  B  B  B  B  B  B )">
> B  B  B <xsl:copy>
> B  B  B  B <xsl:copy-of select="@*"/>
> B  B  B  B <xsl:apply-templates mode="#current">
> B  B  B  B  B <xsl:with-param name="restricted-to" select="$restricted-to"
> tunnel="yes"/>
> B  B  B  B </xsl:apply-templates>
> B  B  B </xsl:copy>
> B  B </xsl:if>
> B </xsl:template>
>
> B <xsl:template match="node()[count(node()) = 0]" mode="split">
> B  B <xsl:param name="restricted-to" as="node()*" tunnel="yes"/>
> B  B <xsl:if test="generate-id(.) = (for $n in $restricted-to return
> generate-id($n))">
> B  B  B <xsl:copy-of select="." />
> B  B </xsl:if>
> B </xsl:template>
>
> </xsl:transform>
>
> =========8<-------------------
>
> (Please note that I called it xsl:transform instead of xsl:stylesheet, as a
> tribute to Roger L. Costello. But that's another thread, a dead thread.)
>
> The stylesheet resp. transformation program does the following:
>
> For each span immediately below a p, call a function that returns multiple
> spans, interspersed with br's.
>
> This function works as follows:
>
> Of all descendants of the span, only select the leaves. So if the structure
> is
> p
> B span(1)
> B  B span(2)
> B  B  B text(a)
> B  B  B br
> B  B  B text(b)
> B  B span(3)
> B  B  B text(c)
> it selects the sequence (text(a), br, text(b), text(c)).
> Then it groups the sequence according to the criterion that all non-br
nodes
> should be grouped (and all br nodes, too, as a consequence).
> So we now have the following groups:
> (text(a)) -- matches the grouping key
> (br) -- doesn't match the grouping key
> (text(b), text(c)) -- matches the grouping key
>
> For each of the non-br groups, span(1) -- the span to be split at br -- is
> processed in mode="split", with the parameter $restricted-to set to the
> current group.
>
> So firstly span(1) is being processed in mode="split" with $restricted-to =
> (text(a)).
> Only if span(1) is among the ancestors of $restricted-to (or among
> $restricted-to itself) will its contents be processed.
> Its contents will be processed in mode="split", with the same
$restricted-to
> parameter.
> Being an ancestor of text(a), span(2) will be processed, while nothing
> happens for span(3).
> As a result of processing span(2) in mode="split", $restricted-to =
> (text(a)), text(a) will be output.
>
> Going back to for-each-group: the next group is br which will be reproduced
> as br, but on the same level as span(1).
>
> So far, our result tree looks like
> p
> B span(1)
> B  B span(2)
> B  B  B text(a)
> B br
>
> The next group is (text(b), text(c)). But again, span(1) will be processed
> in mode="split", now $restricted-to = (text(b) text(c)).
> As an ancestor to any of the $restricted-to leaf nodes, span(1) will be
> reproduced (the element and its original attributes, not the entire
> subtree!).
> As ancestors to each of the leaf nodes, both span(2) and span(3) will be
> reproduced below span(1).
> When processing the subtree of span(2) with the restriction to (text(b),
> text(c)), only text(b) will be output. For span(3), only text(c) will be
> output.
> So finally we have
> p
> B span(1)
> B  B span(2)
> B  B  B text(a)
> B br
> B span(1)
> B  B span(2)
> B  B  B text(b)
> B  B span(3)
> B  B  B text(c)
>
> Although it may seem as overkill at first sight, the big advantage of this
> approach is that it works well for br within nested spans.
>
> With the generic approach (arbitrary XPath expressions for splitting), you
> can extend analyze-string to process markup: in a preparatory pass, use
> plain analyze-string on the text nodes to replace the regex with some
unique
> markup, then use the generic splitting function to split at this markup,
> then treat the resulting nodes as you would have treated matching or
> non-matching substrings.
>
> -Gerrit
>
>
> On 07.06.2010 13:36, Israel Viente wrote:
>>
>> Thank you for your answer Mukul.
>> It does put the br between the spans but lose the spaces between spans
>> and replace them with br.
>>
>> The result of the code you sent gives the following output:
>>
>> <p dir="ltr"><span class="smaller">text1</span><br /><span
>> class="smaller">text2 text3.</span><br /><span
>> class="smalleritalic">no</span><br /><span
>> class="smaller">problems.</span><br /><br /></p>
>>
>> The desired one is:
>>>>
>>>> <p dir="ltr"><span class="smaller">text1</span>
>>>> B  B  B  B  B  B <br />
>>>> B  B  B  B  B  B  <span class="smaller">text2 text3.</span>
>>>> B  B  B  B  B  B <br />
>>>> B  B  B  B  B  B <span class="smalleritalic">no</span> B <span
>>>> class="smaller">problems.</span>
>>>> B  B  B  B  B  B <br />
>>>> B  B  B  B  B  B <br />
>>>> B  B  B  B  B  B </p>

Current Thread