Re: [xsl] Dealing mixed content with invalid node-like text

Subject: Re: [xsl] Dealing mixed content with invalid node-like text
From: Brandon Ibach <brandon.ibach@xxxxxxxxxxxxxxxxxxx>
Date: Tue, 6 Dec 2011 19:22:08 -0500
If the text is "almost" XML, perhaps the easiest thing to do would be
to fix it so it really is XML, then use a character map to output it
as-is so your second pass can just parse it normally.  If all you need
to do is escape the angle-brackets in something like "<1a .>", your
"tag-text" template could be as simple as:

<xsl:value-of select="replace($unparsed, '&lt;(\S+\s+\.)&gt;',
'&amp;lt;$1&amp;gt;')"/>

And you would have declarations such as this at the top level:

<xsl:output method="xml" version="1.0" encoding="utf-8"
use-character-maps="xmlout"/>
<xsl:character-map name="xmlout">
  <xsl:output-character character="&lt;" string="&lt;"/>
  <xsl:output-character character="&gt;" string="&gt;"/>
  <xsl:output-character character="&amp;" string="&amp;"/>
</xsl:character-map>

If you have other content being produced in the first pass, whose
correct output is threatened by this mapping, you may need to do some
additional replacements in your "tag-text" template, substituting
arbitrary characters (such as characters from the Unicode Private Use
area) for less-than, greater-than and ampersand, then adjusting the
character-map to map them back to their original forms.

This sort of markup hacking is not a road I'd recommend going down,
but if you have to do it, I can't really see a reason to do it in some
other language, if XSLT is what you're comfortable with.  Michael made
a good point about using a proper parser (which I wouldn't implement
in XSLT, as a first choice, even though it would be possible) if you
can put together a proper grammar for your input, but if a few regex
substitutions can get you safely to clean XML, the above approach may
suffice.

-Brandon :)


On Tue, Dec 6, 2011 at 5:42 PM, Karlmarx R <karlmarxr@xxxxxxxxx> wrote:
> Hello David,
>
> Yes, I do process the content in 2 stages, preprocess into one form of XML
and then further process that to my final XML form. BUT, BOTH are done in XSL
with one signle file and the problem that I reported is in first stage
conversion itself. To make things even more clear, here is a rough skeleton
and explanation of my process.I get the entire content of the input into a
variable $input-text, and then tokenize it to get each line of data into
another variable, as below.
>
> <xsl:variable name="lines" select="tokenize($input-text, '\r?\n')"/>
>
> <!--then pass it to another template to process each line of data:-->
> <xsl:call-template name="process-lines">
>                 <xsl:with-param name="lines" select="$lines"/>
> </xsl:call-template>
>
> <!-- And here, I  further process it to select the REQUIRED value, -->
> <xsl:template name="process-lines">
>                                 <xsl:param name="lines" as="xs:string*"/>
>
>                                 <xsl:for-each select="$lines">
>                                                 <xsl:variable
name="line-components" select="tokenize(.,'\t')"/>
>
>                                                   <xsl:for-each
select="$line-components[position() = last()]">
>                                                              <value>
>                                                                         
<xsl:call-template name="tag-text">
>
                                                                             
         <xsl:with-param name="unparsed" select="."/>
>                                                                          
</xsl:call-template>
>                                                               </value>
>                                                   </xsl:for-each>
>
>
> <!-- AND IT IS HERE in this "ag-text" template, I try to achieve  what I
explained in my original posting    -->
>  <xsl:template name="tag-text">
>        <xsl:param name="unparsed" required="yes"/>
>          <xsl:analyze-string select="$unparsed"
regex="^(.*?)<(.+)>(.*)</(.+)>(.*?)$">
>
>        etc as posted earlier.
>
> The skeleton input will be like (as I mentioned before):
>
> Line one text <b>within valid node</b> and like <II .> Title etc
> Line two with <1a .> Title etc, <i>within</i> <b>something</b> etc
> another line can be just normal text
> ....
>
> And it is vital here I get the data in the way I wanted, so that out final
output in stage two is correct. And inview of this I cannot use <value-of
select with d-o-e> here. As it seems this cannot be acheived by XSL (looks
likely) I am trying to get my source corrected. But if there are solution
available, in xsl or with better regex, I would be happy to use. I hope the
above clarifies your question.
>
> Thanks,
> Karl

Current Thread