Subject: Re: [xsl] Ingoring HTML From: "Sam D. Chuparkoff" <sdc@xxxxxxxxxx> Date: Fri, 17 Jun 2005 13:39:59 -0700 |
On the dangerous side, I'd try something like: perl -ne '$c.=$_;eof&&($c=~s/<(([^<>](?!<))*?)>//sg&print$c);' foo.xml Because it will probably be fine. For extra danger points, you can put it in a Makefile with no comment. You should be able to do something similar with xsl, but of course this isn't very safe, and I think it would be a lot more complicated. s/<(([^<>](?!<))*?)>//sg; This is '<' some text '>' with no intervening '<', '<', or '>' replaced with nothing. I thought about actually trying to turn this content into xml, but note there's no close quote on that style attribute! Watch out! sdc On Fri, 2005-06-17 at 15:13 -0500, Jon Gorman wrote: > On 6/17/05, Jay Burgess <lists@xxxxxxxxxxx> wrote: > > I apologize if this is in the FAQ, but I've searched and can't find it. (I'm > > kind of new to XSL, so I may just have not seen it.) > > This is a faq of sorts, but I had a little bit of a difficult time > finding an answer to it in Dave Pawson's FAQ as well. Of course, I > just did a quick glance. I'd recommend skimming the the CDATA section > as well. > > > > > I've got some XML that contains HTML-formatted text. For example: > > > > <title><SPAN style="font-size: 13pt; font-family: Verdana; >The > > <b>Text</b> That I Want</SPAN></title> > > > > "HTML-formatted text" is a little bit nonsensical. HTML itself says > that < is meant as a stand-in for <, so when you have it it's not a > tag. Since namespaces were rather slow to get off to start, we ended > up seeing people put so-called "HTML" in XML *cough* RSS *cough*. But > to any XML application, this is one big chunk of text. > > So, some possible advice: > > 1) if you can change the input format so that it uses namespaces and > actually embeds real XHTML into the documents you're creating, do so. > Or at least have it be an option. > > 2) If you can't do that, I'm sure you can find a more general solution > if you hunt through the archives. The essential solution will > probably be along the lines of looking for < and >s and throwing > any text in them out via some of the XPATH/XSLT string functions. > Might be much easier with XSLT 2.0 > > 3) It may be possible with a combination of d-o-e and doing multiple > transformations, regex scripting or other techniques to replace the > various < and > in certain elements but not others, then > reprocess that document through your final stylesheet. Of couse, this > makes it slightly dangerous. > > Dig through the archives there might be a more general solution > already done or someone else will be able to give you one instead of > just giving you some ranting. (I blame Friday afternoon and a slow > server for my current long-winded explanation why this type of > embedding is evil). > > Short answer, it's probably not difficult as long as it's relatively > straightforward. If the "html" inside the xml is complex at all or > you are using < in other places, you might have difficulty. > > Extremely simple if you can just have the input source use namespaces > and you're comfortable with how XSLT deals with namespaces. > > Jon Gorman
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] Ingoring HTML, Jon Gorman | Thread | Re: [xsl] Ingoring HTML, Sam D. Chuparkoff |
RE: [xsl] Ingoring HTML, Jay Burgess | Date | Re: [xsl] Ingoring HTML, Sam D. Chuparkoff |
Month |