[xsl] Dealing mixed content with invalid node-like text

Subject: [xsl] Dealing mixed content with invalid node-like text
From: Karlmarx R <karlmarxr@xxxxxxxxx>
Date: Mon, 5 Dec 2011 03:15:09 +0800 (SGT)
Hello,

I have a situation where in I need to deal mixed content text that
also come with text wthin angle brackets, converted to XML output. For
example, texts like:

"Sometext <xx>within valid node</xx> and like <II .>
Title etc"
"Sometext like <1a .> Title etc, <xx>within <b>something</b> valid
node</xx> etc". 

Now, the output has to be like:

<nodename>Sometext
<xx>within valid node</xx> and like &lt;II .&gt; Title etc</nodename>
<nodename>Sometext like  &lt;1a .&gt; Title etc, <xx>within <b>something</b>
valid node</xx> etc</nodename>

At present I do not get things like <br/> but
assume I get so, it being valid, I should treat it as node. The point I am
trying to make is, <II .> and <1a .> like non-node things needs to be treated
removing their angle brackets to make the XML valid. Currently I use
analyze-string with a regex to deal this, which does not work correctly (due
to mistakes). But I would like to know whether there is good standard solution
to deal with these sort of text. At present each line of text is passed to
this template and treated like:

<xsl:template name="tag-text">
                        <xsl:param name="unparsed" required="yes"/>
                        <xsl:analyze-string select="$unparsed"
regex="^(.*?)&lt;(.+)&gt;(.*)&lt;/(.+)&gt;(.*?)$">   <!-- this regex has
flaws, in that fails to treat those invalid nodes -->
                                    <xsl:matching-substring> ** do process and
if necessary recuressively call this template again ** 
</xsl:matching-substring>
                                   
<xsl:non-matching-substring>
                                               
<xsl:value-of select="."/>
                                   
</xsl:non-matching-substring>

I suspect possibly there could be a better
regex to get the solution I wanted, but not sure whether xslt itself has
better way to deal this. Pls can you suggest possible solutions (incl better
regex if any of you used it successfully). 

Thanks in advance,
Karl

Current Thread