[xsl] Xsplit, multilingual web site, and extracting text from HTML

Subject: [xsl] Xsplit, multilingual web site, and extracting text from HTML
From: skhurshid@xxxxxxxxxx
Date: Thu, 8 Mar 2001 12:09:20 -0500


Hi,
Firt of all I'd like to Thank everyone for their help regarding my multilingual
web site
question and my questions regarding XSplit. I got the most useful responses
from this group.
I finally manged to download the XSplit and have been playing around
with it. I discovered, to my dissapointment, that it doesn't automatically
create the xml files for you though. I took an HTML page and performed
the "Split command" but I simply got an XSL file with all the HTML in it -
the XML file it "generated" was empty.
I read the documentation and it explained that I had to tag the content in
the HTML page first. After I did this, XSplit correctly generated the XSL file
and the XML file.
For example, in an HTML file containing
<p>Hello World</p>
I had to extract the "Hello World" string from the HTML and replace
it with a label prefixed by "psx-" :
<p>psx-mytext</p>
and then add the "Hello World" String to the generated XML file - which contains
<mytext></mytext>

I was hoping XSplit would generate the XML for me by simply using the HTML
tag names and numbering them wherever it found content. e.g.
I was hoping the following html would convert to the following XML
<p>Hello World</p>
would convert to
<p1>Hello World</p1>
in the xml file. That way I wouldn't have to tag the data unless I really wanted
to.

Is what I'd like to do possible in any way with XSplit ? Am I missing something
?
Are there tools out there that would extract all displayable text from HTML
files
replacing them with labels and then put the extracted text in a sperate file
with the
labels. Basically, I'm looking for a way to automate this since we have 1000's
of
HTML files. I think using an XML & XSL solution for a multilingual site is the
way
to go, but I'm having a hard time justifying the initial cost for converting all
our HTML
files. Since it's an automated process I'm hoping that there's tools out there
that
could help us. I'd write a tool myself, but I'd have to create an HTML parser
which
knew where to find all "displayable text" in an HTML page - which seems tough.
I searched on the Web for HTML parsers which extract text but didn't find
anything
similiar to what I mentioned above (that would replace the text with labels
etc).
Any help would be greatly appreciated.
Thanks :-)
-Sher



 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread