Automatic Translation of Formal Specifications

      (define (state0 c)
        (case c
          ((#\`) (consumeChar) (accept 'backquote))
          ((#\') (consumeChar) (accept 'quote))
          ((#\]) (consumeChar) (accept 'rbracket))
          ((#\[) (consumeChar) (accept 'lbracket))
          ((#\)) (consumeChar) (accept 'rparen))
          ((#\() (consumeChar) (accept 'lparen))
          ((#\tab #\newline #\vtab #\page #\return #\space)
           (consumeChar)
           (begin
             (set! string_accumulator_length 0)
             (state0 (scanChar))))
          ((#\;) (consumeChar) (state201 (scanChar)))
          ((#\#) (consumeChar) (state200 (scanChar)))
          ((#\0 #\2 #\3 #\4 #\5 #\6 #\7 #\8 #\9)
           (consumeChar)
           (state131 (scanChar)))
          ((#\a #\b #\c #\d #\e #\f #\g #\h #\i #\j #\k #\l #\m
            #\n #\o #\p #\q #\r #\s #\t #\u #\v #\w #\x #\y #\z
            #\A #\B #\C #\D #\E #\F #\G #\H #\I #\J #\K #\L #\M
            #\N #\O #\P #\Q #\R #\S #\T #\U #\V #\W #\X #\Y #\Z
            #\! #\$ #\% #\& #\* #\/ #\: #\< #\= #\> #\? #\^ #\_
            #\~ #\@ #\|)
           (consumeChar)
           (state68 (scanChar)))
          ((#\) (consumeChar) (state67 (scanChar)))
          ((#\+) (consumeChar) (state61 (scanChar)))
          ((#\-) (consumeChar) (state60 (scanChar)))
          ((#\1) (consumeChar) (state57 (scanChar)))
          ((#\.) (consumeChar) (state56 (scanChar)))
          ((#\,) (consumeChar) (state5 (scanChar)))
          ((#\") (consumeChar) (state4 (scanChar)))
          (else
           (if ((lambda (c)
                  (and (char? c)
                       (> (char->integer c) 127)
                       (let ((cat (char-general-category c)))
                         (memq cat
                               '(Lu Ll Lt Lm Lo Mn Nl No
                                 Pd Pc Po Sc Sm Sk So Co)))))
                c)
               (begin (consumeChar) (state68 (scanChar)))
               (if (eof-object? c)
                   (begin (consumeChar) (accept 'eofobj))
                   (if ((lambda (c) (and (char? c) (char-whitespace? c)))
                        c)
                       (begin
                         (consumeChar)
                         (begin
                           (set! string_accumulator_length 0)
                           (state0 (scanChar))))
                       (if ((lambda (c)
                              (and (char? c) (char=? c (integer->char 133))))
                            c)
                           (begin
                             (consumeChar)
                             (begin
                               (set! string_accumulator_length 0)
                               (state0 (scanChar))))
                           (scannerError errIncompleteToken))))))))

The Scheme programming language has an unusually complicated lexical syntax. In the Larceny implementation of R5RS and R6RS Scheme, that lexical analyzer is implemented by a state machine with 231 states.

If I were maintaining that code by hand, I'd be very upset when a new Scheme standard or SRFI came along and forced me to make some small change to the lexical syntax.

Fortunately, the code you see above was written by machine, by a lexical analyzer generator.

What I maintain by hand is a list of regular expressions that describe all the tokens of Scheme. Here, for example, is a regular expression for identifiers:

       (id (! ((! ; constituent
                  %a..z %A..Z isOtherConstituent
                  ; special initial
                  #\! #\$ #\% #\& #\* #\/ #\: #\< #\= #\> #\? #\^ #\_ #\~
                  ; inline hex escape
                  (#\#\x (! %0..9 %a..f %A..F)
                           (* (! %0..9 %a..f %A..F)) #\;)
                  (#\isAscii)                               ; backslash escape
                  (#\# #\%)                                ; MzScheme randomness
                  #\. #\@ #\|                 ; leading dot or @ or vertical bar
                  (#\+ #\:)                                         ; leading +:
                  (#\- #\:)                                         ; leading -:
                  ; peculiar prefix
                  (#\- #\>))
               (* (! %a..z %A..Z isOtherConstituent
                     #\! #\$ #\% #\& #\* #\/ #\: #\< #\= #\> #\? #\^ #\_ #\~
                     (#\#\x (! %0..9 %a..f %A..F)
                              (* (! %0..9 %a..f %A..F)) #\;)
                     (#\isAscii)                            ; backslash escape
                     (#\# #\%)                             ; MzScheme randomness
                     #\|                     ; embedded or trailing vertical bar
                     %0..9 #\+ #\- #\. #\@ isNdMcMe)))
              ; peculiar identifiers
              #\+ #\- (#\. #\. #\.)
              (#\- #\-)                                                    ;  --
              (#\- #\1 #\+)                                                ; -1+
              (#\1 #\+)                                                    ;  1+
              (#\1 #\-)                                                    ;  1-
              ))

When some new standard or SRFI forces a change in the lexical syntax, I just make some easy change to the list of regular expressions. Then I run that list of regular expressions through a lexical analyzer generator that spits out executable Scheme code for the huge state machine.

The Scheme code produced by that easy process is always correct. I don't have to spend any time at all on debugging it.