Re: [xsl] How to split text element to separate spans?

Dear Israel,

I once wrote a generic splitting routine where you can split at arbitrary XPath expressions, at any depth. It uses saxon:evaluate, though, and is too complicated to be instructive here. So I tried to simplify it, below.

Let's consider this input:

=========8<-------------------

<?xml version="1.0" encoding="utf-8"?>
<doc>
<p dir="ltr"><span class="smaller">text1
            <br />
             text2
            text3.
            <br />
            </span> <span class="smalleritalic">no</span> <span
class="smaller">problems.
            <br />

 

<p dir="ltr"><br/><span class="smaller">text1
            <br />
             <span class="reallytiny">text2 <br /></span>
            text3.
            <br />
            </span> <span class="smalleritalic">no</span> <span
class="smaller">problems.
            <br />

 

<p dir="ltr">  <span class="regular">"What else?"</span></p>
</doc>

=========8<-------------------

The first p contains your original input, the second p contains a br within *nested* spans (and a br immediately below p), and the third one doesn't contain a br.

Applying the stylesheet quoted below, we'll arrive at this output:

=========8<-------------------

<?xml version="1.0" encoding="UTF-8"?><doc> text1 text2 text3. no problems. 

 

 text1 text2 text3. no problems. 

 

<p dir="ltr">  <span class="regular">"What else?"</span></p>
</doc>

=========8<-------------------

You might find it dissatisfying that the XML code doesn't look as pretty-printed as your desired output. In order to arrive at an output as neat as specified, you will need to apply three more passes of whitespace extraction/normalization (left, right, middle) to the top-level spans. If you really have to pretty-print the XML in such a way, I will send you the complete stylesheet.

So here's the version that does just the splitting:

=========8<-------------------

<?xml version="1.0" encoding="utf-8"?>
<xsl:transform
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
  xmlns:my="my"
  version="2.0"
  exclude-result-prefixes="my">

<xsl:output method="xml" indent="no" />

  <!-- Default identity transform: -->
  <xsl:template match="@* | *">
    <xsl:copy>
      <xsl:apply-templates select="@* | node()"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="p/span">
    <xsl:sequence select="my:split-at-br(.)"/>
  </xsl:template>

<xsl:function name="my:split-at-br" as="element(*)+"> <xsl:param name="top" as="element(*)" />  <xsl:for-each-group select="$top//node()[ count(node()) = 0 ]" group-adjacent="not(self::br)"> <xsl:choose> <xsl:when test="current-grouping-key()">  <xsl:apply-templates select="$top" mode="split"> <xsl:with-param name="restricted-to" select="current-group()" tunnel="yes"/> </xsl:apply-templates> </xsl:when> <xsl:otherwise> </xsl:otherwise> </xsl:choose> </xsl:for-each-group> </xsl:function>

<xsl:template match="*" mode="split"> <xsl:param name="restricted-to" as="node()*" tunnel="yes"/>  <xsl:if test="generate-id(.) = ( for $n in $restricted-to return ( for $a in $n/ancestor-or-self::* return generate-id($a) ) )"> <xsl:copy> <xsl:copy-of select="@*"/> <xsl:apply-templates mode="#current"> <xsl:with-param name="restricted-to" select="$restricted-to" tunnel="yes"/> </xsl:apply-templates> </xsl:copy> </xsl:if> </xsl:template>

<xsl:template match="node()[count(node()) = 0]" mode="split"> <xsl:param name="restricted-to" as="node()*" tunnel="yes"/> <xsl:if test="generate-id(.) = (for $n in $restricted-to return generate-id($n))"> <xsl:copy-of select="." /> </xsl:if> </xsl:template>

</xsl:transform>

=========8<-------------------

(Please note that I called it xsl:transform instead of xsl:stylesheet, as a tribute to Roger L. Costello. But that's another thread, a dead thread.)

The stylesheet resp. transformation program does the following:

For each span immediately below a p, call a function that returns multiple spans, interspersed with br's.

This function works as follows:

Of all descendants of the span, only select the leaves. So if the structure is p span(1) span(2) text(a) br text(b) span(3) text(c) it selects the sequence (text(a), br, text(b), text(c)). Then it groups the sequence according to the criterion that all non-br nodes should be grouped (and all br nodes, too, as a consequence). So we now have the following groups: (text(a)) -- matches the grouping key (br) -- doesn't match the grouping key (text(b), text(c)) -- matches the grouping key

For each of the non-br groups, span(1) -- the span to be split at br -- is processed in mode="split", with the parameter $restricted-to set to the current group.

So firstly span(1) is being processed in mode="split" with $restricted-to = (text(a)). Only if span(1) is among the ancestors of $restricted-to (or among $restricted-to itself) will its contents be processed. Its contents will be processed in mode="split", with the same $restricted-to parameter. Being an ancestor of text(a), span(2) will be processed, while nothing happens for span(3). As a result of processing span(2) in mode="split", $restricted-to = (text(a)), text(a) will be output.

Going back to for-each-group: the next group is br which will be reproduced as br, but on the same level as span(1).

So far, our result tree looks like
p
  span(1)
    span(2)
      text(a)
  br

The next group is (text(b), text(c)). But again, span(1) will be processed in mode="split", now $restricted-to = (text(b) text(c)). As an ancestor to any of the $restricted-to leaf nodes, span(1) will be reproduced (the element and its original attributes, not the entire subtree!). As ancestors to each of the leaf nodes, both span(2) and span(3) will be reproduced below span(1). When processing the subtree of span(2) with the restriction to (text(b), text(c)), only text(b) will be output. For span(3), only text(c) will be output. So finally we have p span(1) span(2) text(a) br span(1) span(2) text(b) span(3) text(c)

Although it may seem as overkill at first sight, the big advantage of this approach is that it works well for br within nested spans.

With the generic approach (arbitrary XPath expressions for splitting), you can extend analyze-string to process markup: in a preparatory pass, use plain analyze-string on the text nodes to replace the regex with some unique markup, then use the generic splitting function to split at this markup, then treat the resulting nodes as you would have treated matching or non-matching substrings.

-Gerrit

On 07.06.2010 13:36, Israel Viente wrote:

Thank you for your answer Mukul.
It does put the br between the spans but lose the spaces between spans
and replace them with br.

The result of the code you sent gives the following output:

<p dir="ltr"><span class="smaller">text1</span><br /><span
class="smaller">text2 text3.</span><br /><span
class="smalleritalic">no</span><br /><span
class="smaller">problems.</span><br /><br /></p>

The desired one is:

<p dir="ltr"><span class="smaller">text1</span>
            <br />
             <span class="smaller">text2 text3.</span>
            <br />
            <span class="smalleritalic">no</span>  <span
class="smaller">problems.</span>
            <br />
            <br />
            </p>

<- Previous	Index	Next ->
Re: [xsl] How to split text element, Mukul Gandhi	Thread	Re: [xsl] How to split text element, Israel Viente
RE: [xsl] display & as text, Wendell Piez	Date	RE: [xsl] display & as text, List Owner
	Month

<-prev [Thread] next->	<-prev [Date] next->
Month Index \| List Home