Subject: Question: Tokenizing CDATA attribute values From: David Megginson <dmeggins@xxxxxxxxxx> Date: Mon, 16 Jun 1997 13:37:23 -0400 |
W. Eliot Kimber writes: > In order to do some HyTime processing, I need to tokenize CDATA > attribute values. Is there a way to do this with JADE? (Or rather, > how would I do this with JADE?). Given a full DSSSL > implementation, my first approach would be use the data tokenizer > grove construction facility (the "word-parse" function) to tokenize > the attribute value and then address that, but JADE doesn't support > word-parse. To do this in a conventional Scheme/LISP way, you could start with (string->list), which explodes the string into a list of characters, and (list->string), which reassembles a list of characters into a string. Both of these are currently missing from Jade, so here are my implementations (James: for speed purposes, it would be _very_ useful to have these implemented in C++): ;; ;; Convert a string into a list of characters. ;; (ISO/IEC 10179:1996, clause 8.5.9.9) ;; (define (string->list str) (let loop ((chars '()) (k (- (string-length str) 1))) (if (< k 0) chars (loop (cons (string-ref str k) chars) (- k 1))))) ;; ;; Convert a list of characters into a string. ;; (ISO/IEC 10179:1996, clause 8.5.9.9) ;; (define (list->string chars) (let loop ((cl chars) (str "")) (if (null? cl) str (loop (cdr cl) (string-append str (string (car cl))))))) Now, given these, I can construct a simple routine for splitting a string into (string) tokens: ;; ;; Given a string containing delimited tokens, return a list ;; of the tokens in string form. ;; ;; The second (optional) argument is a list of characters that should ;; be treated as whitespace. ;; ;; Requires (string->list) and (list->string). ;; (define (split str #!optional (whitespace '(#\space))) ; Top-level recursive loop. (let loop ((characters (string->list str)) (current-word '()) (tokens '())) ; If there are no characters left, ; then we're done! (cond ((null? characters) ; Is there a token in progress? (if (null? current-word) (reverse tokens) (reverse (cons (list->string (reverse current-word)) tokens)))) ; If there are characters left, ; then keep going. (#t (let ((c (car characters)) (rest (cdr characters))) ; Are we reading a space? (cond ((member c whitespace) (if (null? current-word) (loop rest '() tokens) (loop rest '() (cons (list->string (reverse current-word)) tokens)))) ; We are reading a non-space (#t (loop rest (cons c current-word) tokens)))))))) Actually, it's not all that simple unless you're a big LISP fan (like I am), but it does seem to work. The first argument is a string to split, and the second argument is an (optional) list of characters to be treated as token separators, which defaults to '(#\space): split(" this that the other thing") ==> '("this" "that" "the" "other" "thing") split("alpha |beta|gamma|delta" '(#\|)) ==> '("alpha " " beta " " gamma " " delta") split("alpha |beta|gamma|delta" '(#\|)) ==> '("alpha" "beta" "gamma" "delta") Happy hunting, David -- David Megginson ak117@xxxxxxxxxxxxxxxxxxx Microstar Software Ltd. dmeggins@xxxxxxxxxxxxx University of Ottawa dmeggins@xxxxxxxxxx http://www.uottawa.ca/~dmeggins DSSSList info and archive: http://www.mulberrytech.com/dsssl/dssslist
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Question: Tokenizing CDATA attribut, W. Eliot Kimber | Thread | Re: Question: Tokenizing CDATA attr, Paul Prescod |
Re: DD: Quick Reference card?, W. Eliot Kimber | Date | Re: Question: Tokenizing CDATA attr, W. Eliot Kimber |
Month |