Re: character substitution.

Subject: Re: character substitution.
From: Tom Moertel <tmoertel@xxxxxxxxxx>
Date: Tue, 11 Aug 1998 11:36:40 -0400
Pawson, David wrote:
> 
> In looking at text->audio preparation using Jade.
> 
> I need to 'clean up' some of the text, such that a
> text to speech engine  will speak the element content more clearly.
> e.g. <tel>(44) 1733-378-777 </tel>
> becomes <tel>44, 1733,378,777 </tel>
> 
> I'm looking to examine an element content (CDATA)
> and map a function over  the output of the data (current-node).
> 
> However.
> 
> What I can't get my tiny mind around is the
> sosofo->character->sosofo transformation needed.

Sosofos are opaque.  You can't go from sosofo->character because you
can't inspect the sosofo's contents, and thus your above chain is
broken. However, you can still do what you want; you just need to
manipulate the datachar nodes in the *grove*, not the character flow
object specifications in the output.

When I first started working with DSSSL, I was confused between nodes in
the grove and flow objects, and this confusion caused me much
difficulty.  Maybe this explanation will help you avoid some of the
struggles I had.

Think of your original SGML document as a hierarchy, with each element
in the document as a node in the hierarchy and each character within
each element attached as a smaller node to the underbelly of its parent
element's node.  Loosely speaking, that hierarchy becomes the "grove"
that Jade hands you after parsing your document.  It represents
everything Jade knows about your document and everything you can ask
Jade.

When it comes time to process your document into something more useful,
you can ask Jade all about the nodes in the grove -- what kind of nodes
they are, what property values they have, and so on.  And, based on the
answers to what you ask, you can provide Jade with a recipe of sorts for
building a sequence of pages or (with some trickery) even another SGML
document.

This recipe takes the form of a specification of a sequence of flow
objects, a sosofo.  Flow objects are things like paragraphs, graphics,
and so on.  Basically, what you're trying to do is build up a sequence
of them that represents a decent presentation of the information
contained within the grove.  In building this sequence (or, more
precisely, a  "recipe" for this sequence), you'll use construction rules
like "(make paragraph ...)" to make tiny flow-object specifications, and
these in turn you'll knit into a big sosofo representing what you hope
will be a decent presentation of your original document's entire
content.

Now, with that picture in mind, here are a few key points:

1.  Nodes in the grove represent your input content.
2.  Flow-Object Specifications (FOSi) represent your recipe for building
    an output document.
3.  Nodes aren't FOSi:  The former are there for your inspection;
    the later aren't.
4.  Once a sosofo is created, you can't change it; you can either
    include it in a larger sosofo or throw it away, but you can't
    modify it's properties.

(I suspect you that you already know most of what I've just written, and
so please accept my apologies for not coming up with a more elegant and
concise summary.)

Now, with that in mind, let's return to your problem.  (I'll assume that
what you really want is to generate a sosofo corresponding to the
content within the TEL element and not, say, use Perl so scrub the
source SGML into a richer format.)

Let's say you've got the following code in your style sheet:

(element tel
  (let ((tel-children (children (current-node))))
     ...))

When it gets called, (current-node) points to a node in the grove that
represents the TEL element that's being processed.  The children of this
node are themselves nodes, most likely representing characters.  So, to
use your example above, "<tel>(44) 1733-378-777</tel>", the picture from
the TEL element down looks like this:

[element: gi: "TEL"]
  |
 [dchar: #\(]--[dchar: #\4]--[dchar: #\4]-- ... --[dchar: #\7]

where dchar is short for "datachar".  Thus, in the DSSSL snippet I
provided above, tel-children is bound to a nodelist that contains the
children of the TEL element, which is the nodelist of datachar nodes
corresponding to "(44) 1733-378-777".  So, if you want to process the
character data, you just need to work with the datachar nodes, in
particular their "char" properties.

For example, the following DSSSL code generates a sosofo corresponding
to "44, 1733,378,777" from the markup "<tel>(44) 1733-378-777</tel>".

(element tel
  (let ((tel-children (children (current-node))))
    (tel-char-nodes-to-cleaned-up-sosofo tel-children)))

(define (tel-char-nodes-to-cleaned-up-sosofo nl)
  (let loop ((charnodes nl) (result (empty-sosofo)))
    (let* ((firstchar (node-list-first charnodes)))
      (cond

       ;; are we done?
       ((node-list-empty? firstchar) result)

       ;; is the next node really a character?
       ((not (equal? 'data-char
		     (node-property 'classnm firstchar)))
	(loop (node-list-rest charnodes)
	      (sosofo-append result firstchar)))

       ;; it's a character, let's process it
       (#t (let* ((charval (node-property 'char firstchar))

		  ;; determine replacement:
		  ;;   - -> ,
		  ;;   ) -> ,
		  ;;   ( -> (nothing)

		  (replacement
		   (cond ((equal? charval #\-) #\,)
			 ((equal? charval #\)) #\,)
			 ((equal? charval #\() #f)
			 (#t charval))))

	     (loop (node-list-rest charnodes)
		   (sosofo-append result
				  (if (char? replacement)
				      (make character
					char: replacement)
				      (empty-sosofo))))))))))

 
I hope that this (rather lengthy) explanation helps.

Cheers,
Tom


-- 
Tom Moertel <tmoertel@xxxxxxxxxx>
Agnew Moyer Smith Inc.
412.322.6333 tel
412.322.6350 fax


 DSSSList info and archive:  http://www.mulberrytech.com/dsssl/dssslist


Current Thread
  • character substitution.
    • Pawson, David - from mail1.ability.netby web4-1.ability.net (8.8.5/8.6.12) with ESMTP id JAA08723Tue, 11 Aug 1998 09:19:56 -0400 (EDT)
      • Sebastian Rahtz - from mail1.ability.netby web4-1.ability.net (8.8.5/8.6.12) with ESMTP id JAA09382Tue, 11 Aug 1998 09:33:04 -0400 (EDT)
      • Tom Moertel - from mail1.ability.netby web4-1.ability.net (8.8.5/8.6.12) with ESMTP id LAA15588Tue, 11 Aug 1998 11:38:33 -0400 (EDT) <=