Re: [xsl] Dealing mixed content with invalid node-like text

Subject: Re: [xsl] Dealing mixed content with invalid node-like text
From: Syd Bauman <Syd_Bauman@xxxxxxxxx>
Date: Sun, 4 Dec 2011 15:00:36 -0500
I think when posting messages with tricky problems like this it is
really important to be very precise.

> "Sometext <xx>within valid node</xx> and like <II .> Title etc"
> "Sometext like <1a .> Title etc, <xx>within <b>something</b> valid node</xx>
etc".

Your XML input *cannot* look like this, as it is not well-formed XML.
It might be equivalent:

  Sometext <xx>within valid node</xx> and like &lt;II .> Title etc
  Sometext like &#x3C;1a .> Title etc, <xx>within <b>something</b> valid
node</xx> etc
  Sometext <xx>within valid node</xx> and like <![CDATA[<]]>II .> Title etc
  Sometext like &#60;1a .> Title etc, <xx>within <b>something</b> valid
node</xx> etc

but XML doesn't permit loose less-thans lying around. If you're
getting this string from something other than XML (e.g., passed in as
a commandline parameter), then the "<" are treated just like any
other character.

Of course, you can't type a "<" into your XSLT program because it
must be a well-formed XML file. But it seems you've already figured
this out, as you've used "&lt;" in your regexps. In any case, to get
the "<" output what you have to do is ... nothing. When you try to
output a less-than you will end up using one of the above methods to
escape it. Your XSLT engine will serialize the less-than however it
chooses. Check it out (whitespace altered for readability):

| $ cat Untitled1.xsl
| <?xml version="1.0" encoding="UTF-8"?>
| <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
version="2.0">
|   <xsl:template match="/">
|     <xsl:message>&#x3C;&#60;&lt;<![CDATA[<]]>this is a fun
message></xsl:message>
|     <root>&#x3C;&#60;&lt;<![CDATA[<]]>this is fun content></root>
|   </xsl:template>
| </xsl:stylesheet>
|
| $ saxon.bash Untitled1.xsl Untitled1.xsl
| &lt;&lt;&lt;&lt;this is a fun message&gt;
| <?xml version="1.0" encoding="UTF-8"?>
| <root>&lt;&lt;&lt;&lt;this is fun content&gt;</root>
|
| $ perl -p -i -e 's,2\.0,1.0,;' Untitled1.xsl
|
| $ xsltproc Untitled1.xsl Untitled1.xsl
| <<<<this is a fun message>
| <?xml version="1.0"?>
| <root>&lt;&lt;&lt;&lt;this is fun content&gt;</root>

> I have a situation where in I need to deal mixed content text that
> also come with text wthin angle brackets, converted to XML output.
> For example, texts like:
>
> "Sometext <xx>within valid node</xx> and like <II .> Title etc"
> "Sometext like <1a .> Title etc, <xx>within <b>something</b> valid node</xx>
etc".
>
> Now, the output has to be like:
>
> <nodename>Sometext <xx>within valid node</xx> and like &lt;II .&gt; Title
etc</nodename>
> <nodename>Sometext like  &lt;1a .&gt; Title etc, <xx>within <b>something</b>
valid node</xx> etc</nodename>

I think the identity transform will do this just fine:

  <xsl:template match="*">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="@*|text()|processing-instruction()|comment()">
    <xsl:copy/>
  </xsl:template>

Note that you don't necessarily get much control over how the XSLT
engine chooses to serialize the "<" characters. (I believe that if
you are using the payware version of Saxon, you can use serialization
options to control this, among other things.)

But it occurs to me that maybe what you want to do is treat the
*string*
  Sometext <xx>within valid node</xx> and like <II .> Title etc
as 4 nodes:
  text() = "Sometext "
  element(xx):
    text() = "within valid node"
  text() = " and like &lt;II .> Title etc"

In which case, someone who knows more about such things will need to
answer, as I don't think I know how to convert a string to a sequence
of nodes or a result tree fragment. I'm not really sure why one would
want to do such a thing, or that it is possible. (Must be, eh? At
worst, wrap it in <root>...</root> and write it out to a file, and
read that in, no?)


> At present I do not get things like <br/> but assume I get so, it
> being valid, I should treat it as node. The point I am trying to
> make is, <II .> and <1a .> like non-node things needs to be treated
> removing their angle brackets to make the XML valid. Currently I
> use analyze-string with a regex to deal this, which does not work
> correctly (due to mistakes). But I would like to know whether there
> is good standard solution to deal with these sort of text. At
> present each line of text is passed to this template and treated
> like:
>
> <xsl:template name="tag-text">
>                         <xsl:param name="unparsed" required="yes"/>
>                         <xsl:analyze-string select="$unparsed"
regex="^(.*?)&lt;(.+)&gt;(.*)&lt;/(.+)&gt;(.*?)$">   <!-- this regex has
flaws, in that fails to treat those invalid nodes -->
>                                     <xsl:matching-substring> ** do process
and if necessary recuressively call this template again ** 
</xsl:matching-substring>
>                                     <xsl:non-matching-substring>
>                                                 <xsl:value-of select="."/>
>                                     </xsl:non-matching-substring>
>
> I suspect possibly there could be a better regex to get the
> solution I wanted, but not sure whether xslt itself has better way
> to deal this. Pls can you suggest possible solutions (incl better
> regex if any of you used it successfully).

Current Thread