RE: [xsl] broken text surrounding an entity I want to drop?

Subject: RE: [xsl] broken text surrounding an entity I want to drop?
From: "Trevor Nicholls" <trevor@xxxxxxxxxxxxxxxxxx>
Date: Wed, 14 Sep 2005 18:47:52 +1200
Thanks Mike and Joris for your comments.

How much text? If I run a text-only script over all the files I end up with
something of the order of 20Mb. Manual fixes are not an attractive idea (at
least not yet).

On balance, it seems to me that the frequency of

---
Text<a></a>
More text
---

is relatively low (maybe 5-10%) compared with

---
Text mo<a></a>
re text
---

so (accepting that a manual pass through is going to be necessary at some
point) I would rather attempt to automate the treatment of the commonest
case. We are still at a proof of concept stage, and broken words in every
other sentence don't look good! If we can reduce that to a few words here
and there we'll be much happier.

Thanks
Trevor

> -----Original Message-----
From: "Michael Kay" <mike@xxxxxxxxxxxx>
Sent: Tue, 13 Sep 2005 08:58:09
Subject: RE: [xsl] broken text surrounding an entity I want to drop?

It helps to get the terminology right (it means people are more likely to
understand your question). You're using the terms "entity" and "tag" when
you mean "element".

You're dealing with dirty data, and data cleansing is always a rather
pragmatic affair. I don't think there's enough information in your source to
decide whether, in a case like

There is too much white<A></A>
space in this document

the author intended "whitespace" to be one word or two.

The only way you're going to be able to automate the data recovery is with
the help of a dictionary lookup, and even that will leave some ambiguities
like the one above.

How long is the text? My instinct would be to fix it by hand.

Michael Kay
http://www.saxonica.com/

Current Thread