[xsl] XML to XML change, handling mixed content

Subject: [xsl] XML to XML change, handling mixed content
From: Karlmarx R <karlmarxr@xxxxxxxxx>
Date: Wed, 19 Oct 2011 04:41:17 +0800 (SGT)
Hello,

I have 2 questions:

1) I have a specific requirement where I am bit
struck with what would be 
the best way to handle it. In a nutshell, I need to
modify the source

<p> 
   
 text text &#x2018; LINK-1 TEXT &#x2019; TEXT TEXT
<URL 
weburl="XXX">XXX</url> TEXT 
<SOmething>TEXT</SOmething>
    AND again
<INSIDE>SOME TEXT text &#x2018; LINK-2 TEXT &#x2019; TEXT 
<URL
weburl="YYY">YYY</url></INSIDE>
    And can be more text with or without URL
and TEXT like &#x2018; LINK-3 TEXT&#x2019;
</p>

to (THE REQUIREMENT)

<p>
    text text <a href="XXX"> LINK-1 TEXT </a> TEXT TEXT TEXT
<SOmething>TEXT</SOmething>
    AND again <another>SOME TEXT text <a
href="XXX"> LINK-2 TEXT </a> TEXT <another>
    And can be more text with or
without URL and TEXT like &#x2018; LINK-3 TEXT&#x2019;
</p>

What
 it required
is, for each <URL>, if the PRECEDING part of string 
had text contained
within  &#x2018; and &#x2019;, then they mut 
be converted to <a href> link.
For me, after narrowing down to 
p[URL], not sure what would be the best
pattern to achieve the desired 
result. Pls can you suggest something? In the
above sample, NOTE that 
the last set of &#x2018; LINK-3 TEXT&#x2019; was left
as it is 
due to no matching URL. Even though XSL1 used, if XSL2 can solve it
easily, pls suggest that also. 


[SAMPLE Skeleton XML and XSL]

XML:

<?xml
version="1.0"
 encoding="UTF-8"?>
<root>
    <something>
       
<blah-blah>Can have many child</blah-blah>        
        <nodeGroup>
            <note id="does-not-matter-1">
                <p>
                    <something><sup>1</sup></something>
                   
some text here. <bidItem id="95522-1" vol="1"> Title Name, Other details,
&#x2018;The
                        arms trade and corruption&#x2019;,
<i>Prospect</i> Aug.2005</bidItem>.
                    
                   
<!-- NOTE: NO URL IN THIS CASE, WHICH IS FINE -->
                </p>
            </note>
            <note id="does-not-matter-2">
               
<p> some text
 &#x2018;Ex-Pentagon procurement executive gets jail
time&#x2019;, text text &lt;
                    <url
webUrl="http://www.aaa.xx/bbb/ddd.htm";>http://www.aaa.xx/bbb/ddd.htm</url>&gt
;; 
                   
 &#x2018;Former Air Force acquisition official
released from 
jail&#x2019;, Government in 2005, &lt;
                    <url
webUrl="http://www.aaa.xx/bbb/uuu.htm";>SAME AS @webUrl</url>&gt;; and
                    <bidItem id="95522-2">Author name., &#x2018;Cashing in for
profit? Who cost taxpayers
 billions in biggest Pentagon scandal in
years?&#x2019;, <i>60 Minutes</i>, CBS, 5 Jan. 2005
                   
</bidItem>, &lt;  <url
webUrl="http://www.cbsnews.com/stories/2005/01/04/60II/main664652.shtml";>SAME
AS @webUrl</url>&gt;.
                    
                    <!-- HERE EACH
URL HAS MATCHING  &#x2018;contens&#x2019; WHICH IS FINE -->
               
</p>
           
 </note>
            <note id="does-not-matter-3">
                <p><something><sup>68</sup></something> This figure is
comprised of a fine of
                    &#xa3;500&#xa0;000 ($900&#xa0;000)
for &#x2018;irregular accounting practices&#x2019;
                    in a
Tanzanian deal for an inappropriate and overpriced air radar system that was
                    tainted by allegations of high-level corruption, with
...($405&#xa0;000)
 costs..
                    &#xa3;29.275 million ($52.695
million) going to Tanzania in reparations. <bidItem
                    
id="996522-31" title="BAE deal with Tanzania...">Evans, R. and 
Lewis, P.,
&#x2018;BAE deal with Tanzania:
                     military air traffic
control&#x2014;for country with no airforce&#x2019;, <i>The
                    
 Guardian</i>, 6 Feb. 2010</bidItem>; &#x2018;Military
radar probe: the key suspects &#x2026; and 
                    the case
against them&#x2019;, <i>This Day</i> (Dar es Salaam), 15 Feb. 2010; &lt;
                   
 <url
webUrl="http://www.judiciary.gov.uk/Resources/JCO/Documents/Judgments/r-v-bae
-sentencing-remarks.pdf">SAME
 AS @webUrl</url>&gt;.
                   
                    <!-- 
                        ONLY ONE URL, BUT MANY 
&#x2018; in-between texts &#x2019; 
                        So, the URL belong
only to its preceding "&#x2018; in-between texts &#x2019"
                   
-->
                </p>
            </note>
        </nodeGroup>
   
</something>


XSL:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"; version="1.0">
   
<xsl:template match="/">
        <xsl:apply-templates select="*"/>
   
</xsl:template>
    
    <xsl:template match="*"> 
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/> 
        </xsl:copy>
    </xsl:template>
    
    <xsl:variable
name="href-start">&lt;href="</xsl:variable>
    <xsl:variable
name="href-mid">"/></xsl:variable>
    <xsl:variable
name="href-finish">&lt;a/></xsl:variable>
    
    <xsl:template match="note">
        <xsl:copy> 
            <xsl:apply-templates
 select="@*"/>         
            <xsl:apply-templates mode="url"/>       
       
</xsl:copy>             
    </xsl:template>
    
        
    <xsl:template
match="p[url]" mode="url">
        <!-- HERE, FOR EACH URL, IT SHOULD FORM A
HREF LINK, COVERING ANY PRECEDING TEXT THAT APPEAR 
            IN-BETWEEN
&#x2018; AND &#x2019;
        
            Ref: MAIL
 DESCRIPTION.
        -->
        <xsl:copy> 
            <xsl:apply-templates select="@*"/>         
            <xsl:apply-templates/>
        </xsl:copy> 
    </xsl:template>
    
    <xsl:template match="p[not(url)]" mode="url">
        <xsl:copy>
            <xsl:apply-templates select="@*"/>          
           
<xsl:apply-templates/>       
        </xsl:copy>             
   
</xsl:template>
    
    <xsl:template
match="@*|text()|comment()|processing-instruction()"> 
        <xsl:copy-of
select="."/> 
    </xsl:template>
    
 <!-- COMMENTED... SOME TRY ALONG THIS
LINE
   <xsl:template .... mode="url">
        <xsl:copy>
            <xsl:...
test="contains(., '&#x2018;')">
                   
<!-\-<xsl:apply-templates>
                        <xsl:sort
select="substring-before(., '&#x2018;')"/>
                       
</xsl:apply-templates>-\->
                    <xsl:value-of
select="substring-before(., '&#x2018;')"/>
                   
 <xsl:value-of
select="$href-start" 
disable-output-escaping="yes"/>[@<xsl:value-of
select="following-sibling::url"/>]<xsl:value-of select="$href-mid"
disable-output-escaping="yes"/>
                    <xsl:value-of
select="substring-after(., '&#x2018;')"/>
                </xsl:...>
                <xsl:... test="contains(., '&#x2019;')">
                   
<xsl:value-of select="substring-before(., '&#x2019;')"/>                   
                    <xsl:value-of select="$href-finish"
disable-output-escaping="yes"/>
                    <xsl:value-of
select="substring-after(., '&#x2019;')"/>
                </xsl:..> 
                <xsl:apply-templates .... mode="url"/>            
      
</xsl:copy> 
    </xsl:template>
-->
    
</xsl:stylesheet>
2) Additionally,
when  dealing with 
such mixed content (I mean containing both text and child
elements), 
what is the best way to split and handle elements and text
seperately?

Thanks and look forward to suggestions,
Karl

Current Thread