|
Subject: Question: Tokenizing CDATA attribute values From: David Megginson <dmeggins@xxxxxxxxxx> Date: Mon, 16 Jun 1997 13:37:23 -0400 |
W. Eliot Kimber writes:
> In order to do some HyTime processing, I need to tokenize CDATA
> attribute values. Is there a way to do this with JADE? (Or rather,
> how would I do this with JADE?). Given a full DSSSL
> implementation, my first approach would be use the data tokenizer
> grove construction facility (the "word-parse" function) to tokenize
> the attribute value and then address that, but JADE doesn't support
> word-parse.
To do this in a conventional Scheme/LISP way, you could start with
(string->list), which explodes the string into a list of characters,
and (list->string), which reassembles a list of characters into a
string. Both of these are currently missing from Jade, so here are my
implementations (James: for speed purposes, it would be _very_ useful
to have these implemented in C++):
;;
;; Convert a string into a list of characters.
;; (ISO/IEC 10179:1996, clause 8.5.9.9)
;;
(define (string->list str)
(let loop ((chars '())
(k (- (string-length str) 1)))
(if (< k 0)
chars
(loop (cons (string-ref str k) chars) (- k 1)))))
;;
;; Convert a list of characters into a string.
;; (ISO/IEC 10179:1996, clause 8.5.9.9)
;;
(define (list->string chars)
(let loop ((cl chars)
(str ""))
(if (null? cl)
str
(loop (cdr cl)
(string-append str (string (car cl)))))))
Now, given these, I can construct a simple routine for splitting a
string into (string) tokens:
;;
;; Given a string containing delimited tokens, return a list
;; of the tokens in string form.
;;
;; The second (optional) argument is a list of characters that should
;; be treated as whitespace.
;;
;; Requires (string->list) and (list->string).
;;
(define (split str #!optional (whitespace '(#\space)))
; Top-level recursive loop.
(let loop ((characters (string->list str))
(current-word '())
(tokens '()))
; If there are no characters left,
; then we're done!
(cond ((null? characters)
; Is there a token in progress?
(if (null? current-word)
(reverse tokens)
(reverse (cons (list->string (reverse current-word))
tokens))))
; If there are characters left,
; then keep going.
(#t
(let ((c (car characters))
(rest (cdr characters)))
; Are we reading a space?
(cond ((member c whitespace)
(if (null? current-word)
(loop rest '() tokens)
(loop rest
'()
(cons (list->string (reverse current-word))
tokens))))
; We are reading a non-space
(#t
(loop rest (cons c current-word) tokens))))))))
Actually, it's not all that simple unless you're a big LISP fan (like
I am), but it does seem to work. The first argument is a string to
split, and the second argument is an (optional) list of characters to
be treated as token separators, which defaults to '(#\space):
split(" this that the other thing")
==> '("this" "that" "the" "other" "thing")
split("alpha |beta|gamma|delta" '(#\|))
==> '("alpha " " beta " " gamma " " delta")
split("alpha |beta|gamma|delta" '(#\|))
==> '("alpha" "beta" "gamma" "delta")
Happy hunting,
David
--
David Megginson ak117@xxxxxxxxxxxxxxxxxxx
Microstar Software Ltd. dmeggins@xxxxxxxxxxxxx
University of Ottawa dmeggins@xxxxxxxxxx
http://www.uottawa.ca/~dmeggins
DSSSList info and archive: http://www.mulberrytech.com/dsssl/dssslist
| Current Thread |
|---|
|
| <- Previous | Index | Next -> |
|---|---|---|
| Question: Tokenizing CDATA attribut, W. Eliot Kimber | Thread | Re: Question: Tokenizing CDATA attr, Paul Prescod |
| Re: DD: Quick Reference card?, W. Eliot Kimber | Date | Re: Question: Tokenizing CDATA attr, W. Eliot Kimber |
| Month |