Subject: RE: [xsl] broken text surrounding an entity I want to drop? From: "Trevor Nicholls" <trevor@xxxxxxxxxxxxxxxxxx> Date: Wed, 14 Sep 2005 18:47:52 +1200 |
Thanks Mike and Joris for your comments. How much text? If I run a text-only script over all the files I end up with something of the order of 20Mb. Manual fixes are not an attractive idea (at least not yet). On balance, it seems to me that the frequency of --- Text<a></a> More text --- is relatively low (maybe 5-10%) compared with --- Text mo<a></a> re text --- so (accepting that a manual pass through is going to be necessary at some point) I would rather attempt to automate the treatment of the commonest case. We are still at a proof of concept stage, and broken words in every other sentence don't look good! If we can reduce that to a few words here and there we'll be much happier. Thanks Trevor > -----Original Message----- From: "Michael Kay" <mike@xxxxxxxxxxxx> Sent: Tue, 13 Sep 2005 08:58:09 Subject: RE: [xsl] broken text surrounding an entity I want to drop? It helps to get the terminology right (it means people are more likely to understand your question). You're using the terms "entity" and "tag" when you mean "element". You're dealing with dirty data, and data cleansing is always a rather pragmatic affair. I don't think there's enough information in your source to decide whether, in a case like There is too much white<A></A> space in this document the author intended "whitespace" to be one word or two. The only way you're going to be able to automate the data recovery is with the help of a dictionary lookup, and even that will leave some ambiguities like the one above. How long is the text? My instinct would be to fix it by hand. Michael Kay http://www.saxonica.com/
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
RE: [xsl] broken text surrounding a, Michael Kay | Thread | Re: [xsl] broken text surrounding a, Joris Gillis |
RE: [xsl] Is it possible to modify , omprakash . v | Date | Re: [xsl] broken text surrounding a, Joris Gillis |
Month |