Re: [xsl] mystery #3: rendering embedded HTML

Subject: Re: [xsl] mystery #3: rendering embedded HTML
From: Gary Lawrence Murphy <garym@xxxxxxxxxx>
Date: 13 Apr 2002 12:26:37 -0400
>>>>> "J" == Jeni Tennison <jeni@xxxxxxxxxxxxxxxx> writes:

    J> You can use disable-output-escaping in this situation. 

Not quite.  doe works for inline literal markup chars:

    J> <envelope> <![CDATA[ <p>My mal-formed HTML.<br> ]]> </envelope>

My situation is the inverse of doe. What I have is

     <envelope>&lt;p&gt;My mal-formed HTML escaped.&lt;br&gt;</envelope>

for which there is no way to extract and _evaluate_ this back into

    <p>My mal-formed HTML escaped<br>

The reason I have this the other way around is because, when you take 

    <envelope> <![CDATA[ <p>My mal-formed HTML.<br> ]]> </envelope>

and pass it through a parser (in our case, into an XML transform from
one DTD to another via a different XSL process), CDATA is just a
pre-processor directive that tells the parser to escape any invalid
chars.  Thus, once stored, your example is physically recorded as

     <envelope>&lt;p&gt;My mal-formed HTML escaped.&lt;br&gt;</envelope>

for which there is apparently no way to extract it again using XSL.

    J> If this HTML makes up the majority of your page, the other
    J> option is to use the text output method rather than the XML
    J> output method:

    J> <xsl:output method="text" />

Again, this is if the XML data contains invalid invalid chars; it
doesn't, it contains _escaped_ chars which need to be resolved
back into invalid chars.  It needs an entity resolver.

    J> But the best solution is nevertheless to tidy up the HTML so
    J> that it's well-formed. 

In our specific case, we don't own the source of the HTML, it comes
from thousands of journalists working for countless independent news
agencies scattered around the world.  

Even in the general case, I still don't think we should impose
techno-formalities like strict XHTML-compliance on non-professionals
unless we want them to eschew our application ;) Technology should
serve the body, not enslave the mind.

   As a pure aside in usability constraints, you should have been
   there when I first tried to get journalists using _basic_ markup
   tags like <em> -- not everyone is super-keen to learn markup
   protocols --- if I forced them into an app that would reject their
   input until all tags within the <div> were legal to the DTD, I'd
   never see another news item submitted, and as soon as their
   managers learned from some part-time teen geek that the HTML code
   my program was dutifully rejecting "works perfectly in MSIE", I'd
   likely never see another contract in that industry :)

-- 
Gary Lawrence Murphy <garym@xxxxxxxxxxx> TeleDynamics Communications Inc
Business Innovations Through Open Source Systems: http://www.teledyn.com
"Computers are useless.  They can only give you answers."(Pablo Picasso)


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread