We begin our systematic study of programming languages by learning how to specify them. We will learn how to specify these two things:

syntax
semantics (meaning or behavior)

Our specifications will begin with these basic concepts:

grammars
abstract syntax
names
values
environments
scope
binding
translation
interpretation

We will need to represent the following data types:

abstract syntax trees
environments
procedures

We will use a confusing mixture of concrete and abstract data types. For example, our abstract syntax trees will generally be concrete data types, but our environments will generally be abstract data types.

(Furthermore, we won't have much to say about the concrete data types used in real implementations of real programming languages. Although the concrete representations that compiler XYZ uses to represent the abstract syntax trees or environments of language ABC on microprocessor IJK under operating system UVW are sort of interesting, if that's the sort of thing that interests you, knowledge of that trivia would not tell you much about the representations that compiler WXY uses to represent the abstract syntax trees or environments of language BCD on microprocessor JKL under operating system VWX.)

Data Abstraction

A data type consists of two parts:

interface
implementation

For an abstract data type, the interface consists of a set of operations that clients are allowed to use when manipulating values of the abstract data type. With abstract data types, clients do not need to know how the data is represented. That means the representation can be changed without breaking any client code. In other words, client code is representation-independent.

Although most of our data types will be abstract, we will still need to distinguish between the idealized (and often mathematical) values we like to talk about, such as sets and procedures, and their representations in some particular implementation of the data type. We therefore need a representation-independent notation for the representation of an idealized value x.

We write ⌈x⌉ to indicate the representation of x.

Specifying Data via Interfaces

The abstract data type of natural numbers:

zero : → Nat
is-zero? : Nat → Bool
successor : Nat → Nat
predecessor : Nat → Nat

(zero) = ⌈0⌉

(is-zero? ⌈n⌉) = #t (n = 0)

(is-zero? ⌈n⌉) = #f (n > 0)

(successor ⌈n⌉) = ⌈n+1⌉ (n ≥ 0)

(predecessor ⌈n+1⌉) = ⌈n⌉ (n ≥ 0)

The abstract data type of natural numbers is far from trivial. For example, the built-in data types of most programming languages can represent only the smallest natural numbers.

    int main (int argc, char* argv[]) {
      unsigned int n = 1;
      while (n != 0) n = n + 1;
    }

In many implementations of C, the loop above terminates in a matter of seconds. Even if int were changed to long int, the loop would probably terminate in a few hundred years.

In Scheme, the corresponding loop would not terminate before our sun goes nova, and that's not just because Scheme is slow. To implement arbitrarily large natural numbers with reasonable efficiency, most implementations of Scheme use at least two different representations for natural numbers. That fact does not matter to Scheme programmers, because Scheme code is representation-independent (at least with respect to natural numbers).

Representation Strategies for Data Types

We will use variations of the environment data type throughout this course.

The Environment Interface

Here is an algebraic specification of the environment ADT:

empty-env : → Env
extend-env : Var × Val × Env → Env
apply-env : Env × Var → Val

(apply-env (extend-env varvalenv)var) = val
(apply-env (extend-env var1valenv)var2)
= (apply-env envvar2) (var1 ≠ var2)

Both empty-env and extend-env are constructors of environments. The only observer is apply-env. An algebraic specification shows only the observable behavior of an abstract data type, so the two equations above both specify how the observer behaves. The only thing that matters about the constructors is how they interact with observers and other operations of the data type.

Notice that the behavior of apply-env on an empty environment is not specified. That means it would be an error for client code to pass an empty environment to apply-env.

We could prove that it would be an error for client code to try to compute (apply-env envx) if x is not bound in env. That proof would proceed by induction on the size of the expression that constructs env.

Data Structure Representation

The environment ADT has two constructors, so every environment can be built by an expression generated by the following grammar:

Env-exp
::= (empty-env)
::= (extend-envIdentifierScheme-valueEnv-exp)

That grammar suggests the following representation of environments:

Env
::= (empty-env)
::= (extend-envVarSchemeValEnv)
Var
::= Sym

Putting that representation together with the algebraic specification, we get the following implementation:

;;; empty-env : -> Env

(define empty-env
  (lambda ()
    (list 'empty-env)))

;;; extend-env : Var * SchemeVal * Env -> Env

(define extend-env
  (lambda (var val env)
    (list 'extend-env var val env)))

;;; apply-env : Env * Var -> SchemeVal

(define apply-env
  (lambda (env search-var)
    (cond ((equal? (car env) 'empty-env)
           (report-no-binding-found search-var))
          ((equal? (car env) 'extend-env)
           (let ((saved-var (car (cdr env)))
                 (saved-val (car (cdr (cdr env))))
                 (saved-env (car (cdr (cdr (cdr env))))))
             (if (equal? saved-var search-var)
                 saved-val
                 (apply-env saved-env search-var))))
          (else
           (report-invalid-env env)))))

Here are some simple tests:

> (define env0 (empty-env))
> (define env1 (extend-env 'x 11 env0))
> (define env2 (extend-env 'y 22 env1))
> (define env3 (extend-env 'x 33 env2))
> (apply-env env3 'x)
33
> (apply-env env3 'y)
22
> (apply-env env2 'x)       ; environments are immutable
11

The design you have just seen illustrates the

Interpreter Recipe

Look at a datum.

Decide what kind of data it represents.

Extract the components of the datum and do the right thing.

Procedural Representation

When an immutable ADT has only one observer, we can represent its values as procedures whose arguments are the other arguments that are passed to the observer.

;;; empty-env : -> Env

(define empty-env
  (lambda ()
    (lambda (search-var)
      (report-no-binding-found search-var))))

;;; extend-env : Var * SchemeVal * Env -> Env

(define extend-env
  (lambda (saved-var saved-val saved-env)
    (lambda (search-var)
      (if (equal? saved-var search-var)
          saved-val
          (apply-env saved-env search-var)))))

;;; apply-env : Env * Var -> SchemeVal

(define apply-env
  (lambda (env search-var)
    (env search-var)))

Do we need to write new tests?

This idea generalizes to immutable ADTs with more than one observer. Instead of representing the values of the ADT as procedures, we would represent them as objects, with one method for each of the observers.

Interfaces for Recursive Data Types

The simplest programming language we will study is the lambda calculus, whose syntax is specified by

Lc-exp
::= Identifier
::= (lambda (Identifier)Lc-exp)
::= (Lc-expLc-exp)

To manipulate programs written in the lambda calculus, we may regard the syntax of the lambda calculus as an abstract data type with three kinds of constructors (one for each production in the grammar above) and two kinds of observers: predicates and extractors.

The Lambda Calculus (as an ADT)

var-exp : Var → Lc-exp
lambda-exp : Var × Lc-exp → Lc-exp
app-exp : Lc-exp × Lc-exp → Lc-exp

var-exp? : Lc-exp → Bool
lambda-exp? : Lc-exp → Bool
app-exp? : Lc-exp → Bool

var-exp->var : Lc-exp → Var
lambda-exp->bound-var : Lc-exp → Var
lambda-exp->body : Lc-exp → Lc-exp
app-exp->rator : Lc-exp → Lc-exp
app-exp->rand : Lc-exp → Lc-exp

The design you have just seen illustrates a recipe for

Designing an Interface for a Recursive Data Type

Include one constructor for each kind of data.

Include one predicate for each kind of data.

Include one extractor for each piece of data that is passed to a constructor of the data type.

A Tool for Defining Recursive Data Types

The define-datatype syntax provided by our EOPL language helps to automate the recipe above. Instead of defining the predicates and extractors, however, the define-datatype approach would have us use a cases expression to distinguish between the different kinds of data and to extract subcomponents of values.

(define-datatype lc-exp lc-exp?
  (var-exp
   (var identifier?))
  (lambda-exp
   (bound-var identifier?)
   (body lc-exp?))
  (app-exp
   (rator lc-exp?)
   (rand lc-exp?)))

The data types defined by define-datatype are not abstract if their interface tells clients to use the cases syntax, because that syntax works only if the data type was defined using define-datatype. If the implementation of the data type were changed so it no longer used define-datatype, then client code would stop working. In other words, clients that use the cases syntax are not representation-independent.

On the other hand, we can use define-datatype to implement abstract data types, so long as the interface does not reveal our use of define-datatype. For example, we can implement the Lc-exp ADT by combining the use of define-datatype above with

;;; var-exp : Var -> Lc-exp
;;; lambda-exp : Var * Lc-exp -> Lc-exp
;;; app-exp : Lc-exp * Lc-exp -> Lc-exp

;;; var-exp? : Lc-exp -> Bool

(define var-exp?
  (lambda (x)
    (and (lc-exp? x)
         (cases lc-exp x
          (var-exp (var) #t)
          (lambda-exp (bound-var body) #f)
          (app-exp (rator rand) #f)))))

;;; lambda-exp? : Lc-exp -> Bool

(define lambda-exp?
  (lambda (x)
    (and (lc-exp? x)
         (cases lc-exp x
          (var-exp (var) #f)
          (lambda-exp (bound-var body) #t)
          (app-exp (rator rand) #f)))))

;;; app-exp? : Lc-exp -> Bool

(define app-exp?
  (lambda (x)
    (and (lc-exp? x)
         (cases lc-exp x
          (var-exp (var) #f)
          (lambda-exp (bound-var body) #f)
          (app-exp (rator rand) #t)))))

;;; var-exp->var : Lc-exp -> Var 

(define var-exp->var
  (lambda (x)
    (cases lc-exp x
     (var-exp (var) var)
     (lambda-exp (bound-var body)
      (eopl:error 'var-exp->var "illegal argument"))
     (app-exp (rator rand)
      (eopl:error 'var-exp->var "illegal argument")))))

;;; lambda-exp->bound-var : Lc-exp -> Var 

(define lambda-exp->bound-var
  (lambda (x)
    (cases lc-exp x
     (var-exp (var)
      (eopl:error 'lambda-exp->bound-var "illegal argument"))
     (lambda-exp (bound-var body)
      bound-var)
     (app-exp (rator rand)
      (eopl:error 'lambda-exp->bound-var "illegal argument")))))

;;; lambda-exp->body : Lc-exp -> Lc-exp 

(define lambda-exp->body
  (lambda (x)
    (cases lc-exp x
     (var-exp (var)
      (eopl:error 'lambda-exp->body "illegal argument"))
     (lambda-exp (bound-var body)
      body)
     (app-exp (rator rand)
      (eopl:error 'lambda-exp->body "illegal argument")))))

;;; app-exp->rator : Lc-exp -> Lc-exp 

(define app-exp->rator
  (lambda (x)
    (cases lc-exp x
     (var-exp (var)
      (eopl:error 'app-exp->rator "illegal argument"))
     (lambda-exp (bound-var body)
      (eopl:error 'app-exp->rator "illegal argument"))
     (app-exp (rator rand)
      rator))))

;;; app-exp->rand : Lc-exp -> Lc-exp 

(define app-exp->rand
  (lambda (x)
    (cases lc-exp x
     (var-exp (var)
      (eopl:error 'app-exp->rand "illegal argument"))
     (lambda-exp (bound-var body)
      (eopl:error 'app-exp->rand "illegal argument"))
     (app-exp (rator rand)
      rand))))

The verbosity of the implementation above shows you why we will usually give up on abstract data types when we use define-datatype.

It would not be hard to design a datatype definition facility that resembles define-datatype but is far more convenient for defining abstract data types. If I were the author of our textbook, we would use a different datatype definition facility and would make far greater use of abstract data types. If we were to use abstract data types throughout this course, however, then most of the code in the book wouldn't work. Given the unfortunate choice between following the book and using its concrete data types, versus throwing the book away so we can use abstract data types, we have reluctantly concluded we should follow the book.

For a while, at least.

The data types defined by define-datatype may be mutually recursive, which is a feature we'll need for languages more complex than lambda calculus.

Abstract Syntax and Its Representation

Most computer programs are represented as plain text. For example, the following text is a computer program written in Standard ML:

    3 + 4 * 5 - 6

Plain text is a convenient representation when you're writing, reading, editing, or printing a program, but it is an extremely inconvenient (and inefficient) representation when you're executing a program. It is also an inconvenient representation when you're trying to describe the semantics (meaning) of a program, or when you're trying to prove that a program has some property, or when you're trying to study the principles of programming languages in general.

When you interpret or compile the plain text representation of a program, the interpreter or compiler starts out by translating the program's text into a more convenient representation: an abstract syntax tree. The rest of the interpreter or compiler operates on the abstract syntax tree, referring to the original plain text representation only in error messages.

For a simple language of arithmetic expressions, the abstract syntax trees might belong to the datatype defined by

  (define-datatype arithmetic-exp arithmetic-exp?
    (constant-exp
      (num integer?))
    (addition-exp
      (operand1 arithmetic-exp?)
      (operand2 arithmetic-exp?))
    (subtraction-exp
      (operand1 arithmetic-exp?)
      (operand2 arithmetic-exp?))
    (multiplication-exp
      (operand1 arithmetic-exp?)
      (operand2 arithmetic-exp?))
    (var-exp
      (id symbol?)))

For efficiency, the translation from plain text to abstract syntax trees is usually divided into two steps: scanning and parsing. The algorithms used for scanning and parsing are beyond the scope of this course, but you can learn about them by taking the compiler course, CS G262. In this course we will use semi-automatic scanner and parser generators to define a single procedure, usually named scan&parse, that takes a string containing the plain text representation of a program and returns its abstract syntax tree. For example,

    (scan&parse "3 + 4 * 5 - 6")

would return the value of the following expression:

    (addition-exp
      (constant-exp 3)
      (subtraction-exp
        (multiplication-exp
          (constant-exp 4)
          (constant-exp 5))
        (constant-exp 6)))

Unparsing is usually easier than parsing:

    (define unparse-arithmetic-exp
      (lambda (exp)
        (cases arithmetic-exp exp
         (constant-exp (n)
           (number->string n))
         (addition-exp (exp1 exp2)
           (string-append (unparse-arithmetic-exp exp1)
                          " + "
                          (unparse-arithmetic-exp exp2)))
         (subtraction-exp (exp1 exp2)
           (string-append (unparse-arithmetic-exp exp1)
                          " - "
                          (unparse-arithmetic-exp exp2)))
         (multiplication-exp (exp1 exp2)
           (string-append (unparse-arithmetic-exp exp1)
                          " * "
                          (unparse-arithmetic-exp exp2)))
         (var-exp (id)
           (symbol->string id)))))

For example:

> (define exp3456
    (addition-exp
      (constant-exp 3)
      (subtraction-exp
        (multiplication-exp
          (constant-exp 4)
          (constant-exp 5))
        (constant-exp 6))))
> (unparse-arithmetic-exp exp3456)
"3 + 4 * 5 - 6"

Expressions

The first programming languages we will study are expression languages. We will use SLLgen grammars to specify the syntax of these languages and the representations of their abstract syntax trees. We will then specify the semantics of these languages by writing interpreters for the abstract syntax trees. These interpreters take an environment as their second argument, which records the value of any variables that may appear free within the expression.

Specification and Implementation Strategy

(value-of expρ) = val

means the value of expression exp in environment ρ should be val.

The source language is the language we are defining, specifying, or implementing. The implementation language (usually Scheme with EoPL extensions) is the language in which we write our interpreters.

The front end of an interpreter or compiler translates the source language into abstract syntax trees. A compiler translates abstract syntax trees into some target language, such as Intel x86-32 machine code or JVM byte code. The abstract syntax trees or target language can then be executed by some interpreter. For example, an Intel Core 2 Duo contains an extremely efficient interpreter for Intel x86-32 machine code:

> (define add1
    (lambda (n) (+ n 1)))

> add1
#<PROCEDURE add1>

> (nasm-disassemble add1)
00000000  83FB04            cmp ebx,byte +0x4
00000003  7411              jz 0x16
00000005  C7452C04000000    mov dword [ebp+0x2c],0x4
0000000C  FF9500020000      call near [ebp+0x200]
00000012  90                nop
00000013  90                nop
00000014  EBEA              jmp short 0x0
00000016  F6C103            test cl,0x3
00000019  750A              jnz 0x25
0000001B  89CB              mov ebx,ecx
0000001D  83C304            add ebx,byte +0x4
00000020  710E              jno 0x30
00000022  83EB04            sub ebx,byte +0x4
00000025  89CB              mov ebx,ecx
00000027  B804000000        mov eax,0x4
0000002C  FF551C            call near [ebp+0x1c]
0000002F  90                nop
00000030  C3                ret
0

> (add1 (expt 10 70))
10000000000000000000000000000000000000000000000000000000000000000000001

Our interpreters will not be as efficient as the Intel Core 2 Duo, but they will be much simpler, much easier to build, and much easier to understand.

Scanning divides the plain text of a source program into meaningful substrings called tokens. The tokens are described by a lexical specification.

Parsing translates the sequence of tokens into an abstract syntax tree. The syntactically legal sequences of tokens are described by the source language's grammar.

A parser generator is a program whose inputs include a lexical specification, a grammar, and a description of the abstract syntax trees to be constructed for each production of the grammar. The main outputs of the parser generator are a scanner and parser.

We will use the SLLgen parser generator for most of this course. For MP3, however, the mp3-data-structures.scm will contain a hand-written scanner and a complete parser that was generated by a different parser generator. This is just to show you what a scanner and parser look like. In future assignments, where the scanners and parsers will be more complicated, you will see the lexical specifications and the grammars but will not see the scanners and parsers built from them. The main thing to remember is that scan&parse takes a string containing the plain text representation of a program, and returns the abstract syntax tree for that program.

LET: A Simple Language

Specifying the Syntax

Syntax for the LET language

`Program`	`::=`	`Expression`	`a-program (exp1)`
`Expression`	`::=`	`Number`	`const-exp (num)`
	`::=`	`-(Expression,Expression)`	`diff-exp (exp1 exp2)`
	`::=`	`zero? (Expression)`	`zero?-exp (exp1)`
	`::=`	`if` `ExpressionthenExpressionelseExpression`	`if-exp (exp1 exp2 exp3)`
	`::=`	`Identifier`	`var-exp (var)`
	`::=`	`let` `Identifier=ExpressioninExpression`	`let-exp (var exp1 body)`

For example,

(scan&parse "let x = 4 in -(x,-(1,x))")

evaluates to the abstract syntax tree that is the result of

(a-program
  (let-exp 'x
           (const-exp 4)
           (diff-exp (var-exp 'x)
                     (diff-exp (const-exp 1)
                               (var-exp 'x)))))

Specification of Values

For any programming language, the expressed values are the possible values of an expression, and the denoted values are the values to which a variable can be bound in some environment.

For LET, the expressed and denoted values happen to be the same:

ExpVal = Int + Bool
DenVal = Int + Bool

The expressed and denoted values will be abstract data types with this algebraic specification:

num-val : Int → ExpVal
bool-val : Bool → ExpVal
expval->num : ExpVal → Int
expval->bool : ExpVal → Bool

(expval->num (num-val n)) = n
(expval->bool (bool-val b)) = b

Environments

We use the following abbreviations:

ρ ranges over environments
[] denotes the empty environment
[var = val]ρ denotes (extend-env varvalρ)
[var = val] denotes [var = val][]

Specifying the Behavior of Expressions

Interface for expressions of LET

const-exp : Int → Exp
zero?-exp : Exp → Exp
if-exp : Exp × Exp × Exp → Exp
diff-exp : Exp × Exp → Exp
var-exp : Symbol → Exp
let-exp : Symbol × Exp × Exp → Exp

value-of : Exp × Env → ExpVal

Specification for three kinds of expressions

(value-of (const-exp n)ρ) = (num-val n)

(value-of (var-exp var)ρ) = (apply-env ρvar)

(value-of (diff-exp exp₁exp₂)ρ)
= (- (expval->num (value-ofexp₁ρ)) (expval->num (value-ofexp₂ρ)))

Specifying the Behavior of Programs

For LET, specifying the behavior of programs amounts to specifying the initial environment. For most programming languages, the initial environment consists of a standard set of predefined libraries that every implementation of the language is supposed to provide. For LET, we'll mimic that by providing three predefined identifiers.

(value-of-program exp) = (value-of expρ₀)

where

ρ₀ = [i=1,v=5,x=10]

Specifying Conditionals

(value-of exp₁ρ) = val₁
(expval->num val₁) = 0
------------------------------------
(value-of exp₁ρ) = (bool-val #t)

(value-of exp₁ρ) = val₁
(expval->num val₁) = n
n ≠ 0
------------------------------------
(value-of exp₁ρ) = (bool-val #f)

(value-of exp₁ρ) = val₁
(expval->bool val₁) = #t
----------------------------------------------------
(value-of (if-exp exp₁exp₂exp₃)ρ) = (value-of exp₂ρ)

(value-of exp₁ρ) = val₁
(expval->bool val₁) = #f
----------------------------------------------------
(value-of (if-exp exp₁exp₂exp₃)ρ) = (value-of exp₃ρ)

Specifying `let`

(value-of exp₁ρ) = val₁
------------------------------------
(value-of (let-exp varexp₁body)ρ)
= (value-of body[var=val₁]ρ)

Last updated 30 January 2008.

Specifying Programming Languages