Question: Tokenizing CDATA attribute values

Subject: Question: Tokenizing CDATA attribute values
From: David Megginson <dmeggins@xxxxxxxxxx>
Date: Mon, 16 Jun 1997 13:37:23 -0400
W. Eliot Kimber writes:

 > In order to do some HyTime processing, I need to tokenize CDATA
 > attribute values.  Is there a way to do this with JADE? (Or rather,
 > how would I do this with JADE?).  Given a full DSSSL
 > implementation, my first approach would be use the data tokenizer
 > grove construction facility (the "word-parse" function) to tokenize
 > the attribute value and then address that, but JADE doesn't support
 > word-parse.

To do this in a conventional Scheme/LISP way, you could start with
(string->list), which explodes the string into a list of characters,
and (list->string), which reassembles a list of characters into a
string.  Both of these are currently missing from Jade, so here are my
implementations (James: for speed purposes, it would be _very_ useful
to have these implemented in C++):

  ;;
  ;; Convert a string into a list of characters.
  ;; (ISO/IEC 10179:1996, clause 8.5.9.9)
  ;;
  (define (string->list str)
    (let loop ((chars '())
               (k (- (string-length str) 1)))
      (if (< k 0)
          chars
          (loop (cons (string-ref str k) chars) (- k 1)))))

  ;;
  ;; Convert a list of characters into a string.
  ;; (ISO/IEC 10179:1996, clause 8.5.9.9)
  ;;
  (define (list->string chars)
    (let loop ((cl chars)
               (str ""))
      (if (null? cl)
          str
          (loop (cdr cl)
                (string-append str (string (car cl)))))))

Now, given these, I can construct a simple routine for splitting a
string into (string) tokens:

  ;;
  ;; Given a string containing delimited tokens, return a list
  ;; of the tokens in string form.
  ;;
  ;; The second (optional) argument is a list of characters that should
  ;; be treated as whitespace.
  ;;
  ;; Requires (string->list) and (list->string).
  ;;
  (define (split str #!optional (whitespace '(#\space)))
                                          ; Top-level recursive loop.
    (let loop ((characters (string->list str))
               (current-word '())
               (tokens '()))

                                          ; If there are no characters left,
                                          ; then we're done!
       (cond ((null? characters)
                                          ; Is there a token in progress?
              (if (null? current-word)
                  (reverse tokens)
                  (reverse (cons (list->string (reverse current-word))
                                 tokens))))

                                          ; If there are characters left,
                                          ; then keep going.
             (#t
              (let ((c (car characters))
                    (rest (cdr characters)))

                                          ; Are we reading a space?
                (cond ((member c whitespace)
                       (if (null? current-word)
                           (loop rest '() tokens)
                           (loop rest
                                 '()
                                 (cons (list->string (reverse current-word))
                                       tokens))))

                                          ; We are reading a non-space
                      (#t
                       (loop rest (cons c current-word) tokens))))))))


Actually, it's not all that simple unless you're a big LISP fan (like
I am), but it does seem to work.  The first argument is a string to
split, and the second argument is an (optional) list of characters to
be treated as token separators, which defaults to '(#\space):

  split("  this that the other    thing")       
    ==> '("this" "that" "the" "other" "thing")

  split("alpha |beta|gamma|delta" '(#\|))
    ==> '("alpha " " beta " " gamma " " delta")

  split("alpha |beta|gamma|delta" '(#\|))
    ==> '("alpha" "beta" "gamma" "delta")


Happy hunting,


David

-- 
David Megginson                 ak117@xxxxxxxxxxxxxxxxxxx
Microstar Software Ltd.         dmeggins@xxxxxxxxxxxxx
University of Ottawa            dmeggins@xxxxxxxxxx
        http://www.uottawa.ca/~dmeggins

 DSSSList info and archive:  http://www.mulberrytech.com/dsssl/dssslist


Current Thread