Dear Israel,
I once wrote a generic splitting routine where you can split at
arbitrary XPath expressions, at any depth. It uses saxon:evaluate,
though, and is too complicated to be instructive here. So I tried to
simplify it, below.
Let's consider this input:
=========8<-------------------
<?xml version="1.0" encoding="utf-8"?>
<doc>
<p dir="ltr"><span class="smaller">text1
<br />
text2
text3.
<br />
</span> <span class="smalleritalic">no</span> <span
class="smaller">problems.
<br />
<br /></span></p>
<p dir="ltr"><br/><span class="smaller">text1
<br />
<span class="reallytiny">text2 <br /></span>
text3.
<br />
</span> <span class="smalleritalic">no</span> <span
class="smaller">problems.
<br />
<br /></span></p>
<p dir="ltr"> <span class="regular">"What else?"</span></p>
</doc>
=========8<-------------------
The first p contains your original input, the second p contains a br
within *nested* spans (and a br immediately below p), and the third one
doesn't contain a br.
Applying the stylesheet quoted below, we'll arrive at this output:
=========8<-------------------
<?xml version="1.0" encoding="UTF-8"?><doc>
<p dir="ltr"><span class="smaller">text1
</span><br/><span class="smaller">
text2
text3.
</span><br/><span class="smaller">
</span> <span class="smalleritalic">no</span> <span
class="smaller">problems.
</span><br/><span class="smaller">
</span><br/></p>
<p dir="ltr"><br/><span class="smaller">text1
</span><br/><span class="smaller">
<span class="reallytiny">text2 </span></span><br/><span
class="smaller">
text3.
</span><br/><span class="smaller">
</span> <span class="smalleritalic">no</span> <span
class="smaller">problems.
</span><br/><span class="smaller">
</span><br/></p>
<p dir="ltr"> <span class="regular">"What else?"</span></p>
</doc>
=========8<-------------------
You might find it dissatisfying that the XML code doesn't look as
pretty-printed as your desired output. In order to arrive at an output
as neat as specified, you will need to apply three more passes of
whitespace extraction/normalization (left, right, middle) to the
top-level spans. If you really have to pretty-print the XML in such a
way, I will send you the complete stylesheet.
So here's the version that does just the splitting:
=========8<-------------------
<?xml version="1.0" encoding="utf-8"?>
<xsl:transform
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:my="my"
version="2.0"
exclude-result-prefixes="my">
<xsl:output method="xml" indent="no" />
<!-- Default identity transform: -->
<xsl:template match="@* | *">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="p/span">
<xsl:sequence select="my:split-at-br(.)"/>
</xsl:template>
<!-- split-at-br is intended for
<p>foo<br/>bar</p>
-> <p>foo</p><br/><p>bar</p> -->
<xsl:function name="my:split-at-br" as="element(*)+">
<xsl:param name="top" as="element(*)" />
<!-- group adjacent leaves (text nodes, empty elements) which are
not br: -->
<xsl:for-each-group
select="$top//node()[ count(node()) = 0 ]"
group-adjacent="not(self::br)">
<xsl:choose>
<xsl:when test="current-grouping-key()">
<!-- output the top element and its subtree, restricted to
all ancestors of the current leaf group and the current
leaf group itself: -->
<xsl:apply-templates select="$top" mode="split">
<xsl:with-param name="restricted-to"
select="current-group()" tunnel="yes"/>
</xsl:apply-templates>
</xsl:when>
<xsl:otherwise>
<br/>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each-group>
</xsl:function>
<xsl:template match="*" mode="split">
<xsl:param name="restricted-to" as="node()*" tunnel="yes"/>
<!-- Only process this element if it's within the restriction group
or its members' ancestors: -->
<xsl:if test="generate-id(.) = (
for $n in $restricted-to
return (
for $a in $n/ancestor-or-self::*
return generate-id($a)
)
)">
<xsl:copy>
<xsl:copy-of select="@*"/>
<xsl:apply-templates mode="#current">
<xsl:with-param name="restricted-to" select="$restricted-to"
tunnel="yes"/>
</xsl:apply-templates>
</xsl:copy>
</xsl:if>
</xsl:template>
<xsl:template match="node()[count(node()) = 0]" mode="split">
<xsl:param name="restricted-to" as="node()*" tunnel="yes"/>
<xsl:if test="generate-id(.) = (for $n in $restricted-to return
generate-id($n))">
<xsl:copy-of select="." />
</xsl:if>
</xsl:template>
</xsl:transform>
=========8<-------------------
(Please note that I called it xsl:transform instead of xsl:stylesheet,
as a tribute to Roger L. Costello. But that's another thread, a dead
thread.)
The stylesheet resp. transformation program does the following:
For each span immediately below a p, call a function that returns
multiple spans, interspersed with br's.
This function works as follows:
Of all descendants of the span, only select the leaves. So if the
structure is
p
span(1)
span(2)
text(a)
br
text(b)
span(3)
text(c)
it selects the sequence (text(a), br, text(b), text(c)).
Then it groups the sequence according to the criterion that all non-br
nodes should be grouped (and all br nodes, too, as a consequence).
So we now have the following groups:
(text(a)) -- matches the grouping key
(br) -- doesn't match the grouping key
(text(b), text(c)) -- matches the grouping key
For each of the non-br groups, span(1) -- the span to be split at br --
is processed in mode="split", with the parameter $restricted-to set to
the current group.
So firstly span(1) is being processed in mode="split" with
$restricted-to = (text(a)).
Only if span(1) is among the ancestors of $restricted-to (or among
$restricted-to itself) will its contents be processed.
Its contents will be processed in mode="split", with the same
$restricted-to parameter.
Being an ancestor of text(a), span(2) will be processed, while nothing
happens for span(3).
As a result of processing span(2) in mode="split", $restricted-to =
(text(a)), text(a) will be output.
Going back to for-each-group: the next group is br which will be
reproduced as br, but on the same level as span(1).
So far, our result tree looks like
p
span(1)
span(2)
text(a)
br
The next group is (text(b), text(c)). But again, span(1) will be
processed in mode="split", now $restricted-to = (text(b) text(c)).
As an ancestor to any of the $restricted-to leaf nodes, span(1) will be
reproduced (the element and its original attributes, not the entire
subtree!).
As ancestors to each of the leaf nodes, both span(2) and span(3) will be
reproduced below span(1).
When processing the subtree of span(2) with the restriction to (text(b),
text(c)), only text(b) will be output. For span(3), only text(c) will be
output.
So finally we have
p
span(1)
span(2)
text(a)
br
span(1)
span(2)
text(b)
span(3)
text(c)
Although it may seem as overkill at first sight, the big advantage of
this approach is that it works well for br within nested spans.
With the generic approach (arbitrary XPath expressions for splitting),
you can extend analyze-string to process markup: in a preparatory pass,
use plain analyze-string on the text nodes to replace the regex with
some unique markup, then use the generic splitting function to split at
this markup, then treat the resulting nodes as you would have treated
matching or non-matching substrings.
-Gerrit
On 07.06.2010 13:36, Israel Viente wrote:
Thank you for your answer Mukul.
It does put the br between the spans but lose the spaces between spans
and replace them with br.
The result of the code you sent gives the following output:
<p dir="ltr"><span class="smaller">text1</span><br /><span
class="smaller">text2 text3.</span><br /><span
class="smalleritalic">no</span><br /><span
class="smaller">problems.</span><br /><br /></p>
The desired one is:
<p dir="ltr"><span class="smaller">text1</span>
<br />
<span class="smaller">text2 text3.</span>
<br />
<span class="smalleritalic">no</span> <span
class="smaller">problems.</span>
<br />
<br />
</p>