We begin our systematic study of programming languages by learning how to specify them. We will learn how to specify these two things:
Our specifications will begin with these basic concepts:
We will need to represent the following data types:
We will use a confusing mixture of concrete and abstract data types. For example, our abstract syntax trees will generally be concrete data types, but our environments will generally be abstract data types.
(Furthermore, we won't have much to say about the concrete data types used in real implementations of real programming languages. Although the concrete representations that compiler XYZ uses to represent the abstract syntax trees or environments of language ABC on microprocessor IJK under operating system UVW are sort of interesting, if that's the sort of thing that interests you, knowledge of that trivia would not tell you much about the representations that compiler WXY uses to represent the abstract syntax trees or environments of language BCD on microprocessor JKL under operating system VWX.)
A data type consists of two parts:
For an abstract data type, the interface consists of a set of operations that clients are allowed to use when manipulating values of the abstract data type. With abstract data types, clients do not need to know how the data is represented. That means the representation can be changed without breaking any client code. In other words, client code is representation-independent.
Although most of our data types will be abstract, we will still need to distinguish between the idealized (and often mathematical) values we like to talk about, such as sets and procedures, and their representations in some particular implementation of the data type. We therefore need a representation-independent notation for the representation of an idealized value x.
We write ⌈x⌉ to indicate the representation of x.
The abstract data type of natural numbers:
zero : → Nat
is-zero? : Nat → Bool
successor : Nat → Nat
predecessor : Nat → Nat
(zero)
= ⌈0⌉
(is-zero?
⌈n⌉)
=#t
(n = 0)
(is-zero?
⌈n⌉)
=#f
(n > 0)
(successor
⌈n⌉)
= ⌈n+1⌉ (n ≥ 0)
(predecessor
⌈n+1⌉)
= ⌈n⌉ (n ≥ 0)
The abstract data type of natural numbers is far from trivial. For example, the built-in data types of most programming languages can represent only the smallest natural numbers.
int main (int argc, char* argv[]) { unsigned int n = 1; while (n != 0) n = n + 1; }
In many implementations of C, the loop above terminates
in a matter of seconds. Even if int
were
changed to long int
, the loop would probably
terminate in a few hundred years.
In Scheme, the corresponding loop would not terminate before our sun goes nova, and that's not just because Scheme is slow. To implement arbitrarily large natural numbers with reasonable efficiency, most implementations of Scheme use at least two different representations for natural numbers. That fact does not matter to Scheme programmers, because Scheme code is representation-independent (at least with respect to natural numbers).
We will use variations of the environment data type throughout this course.
Here is an algebraic specification of the environment ADT:
empty-env : → Env
extend-env : Var × Val × Env → Env
apply-env : Env × Var → Val
(apply-env (extend-env
varval
env
)
var)
= val
(apply-env (extend-env
var1val
env
)
var2)
=(apply-env
envvar2
)
(var1 ≠ var2)
Both empty-env
and extend-env
are constructors of environments. The only observer is
apply-env
. An algebraic specification
shows only the observable behavior of an abstract data
type, so the two equations above both specify how the
observer behaves. The only thing that matters about
the constructors is how they interact with observers
and other operations of the data type.
Notice that the behavior of
apply-env
on an empty environment is not
specified. That means it would be an error for client
code to pass an empty environment to apply-env
.
We could prove that it would be an
error for client code to try to compute
(apply-env
env
x)
if x is not bound in env.
That proof would proceed by induction on the size of the
expression that constructs env.
The environment ADT has two constructors, so every environment can be built by an expression generated by the following grammar:
Env-exp
::=(empty-env)
::=(extend-env
IdentifierScheme-value
Env-exp
)
That grammar suggests the following representation of environments:
Env
::=(empty-env)
::=(extend-env
VarSchemeVal
Env
)
Var
::= Sym
Putting that representation together with the algebraic specification, we get the following implementation:
;;; empty-env : -> Env (define empty-env (lambda () (list 'empty-env))) ;;; extend-env : Var * SchemeVal * Env -> Env (define extend-env (lambda (var val env) (list 'extend-env var val env))) ;;; apply-env : Env * Var -> SchemeVal (define apply-env (lambda (env search-var) (cond ((equal? (car env) 'empty-env) (report-no-binding-found search-var)) ((equal? (car env) 'extend-env) (let ((saved-var (car (cdr env))) (saved-val (car (cdr (cdr env)))) (saved-env (car (cdr (cdr (cdr env)))))) (if (equal? saved-var search-var) saved-val (apply-env saved-env search-var)))) (else (report-invalid-env env)))))
Here are some simple tests:
> (define env0 (empty-env)) > (define env1 (extend-env 'x 11 env0)) > (define env2 (extend-env 'y 22 env1)) > (define env3 (extend-env 'x 33 env2)) > (apply-env env3 'x) 33 > (apply-env env3 'y) 22 > (apply-env env2 'x) ; environments are immutable 11
The design you have just seen illustrates the
- Look at a datum.
- Decide what kind of data it represents.
- Extract the components of the datum and do the right thing.
When an immutable ADT has only one observer, we can represent its values as procedures whose arguments are the other arguments that are passed to the observer.
;;; empty-env : -> Env (define empty-env (lambda () (lambda (search-var) (report-no-binding-found search-var)))) ;;; extend-env : Var * SchemeVal * Env -> Env (define extend-env (lambda (saved-var saved-val saved-env) (lambda (search-var) (if (equal? saved-var search-var) saved-val (apply-env saved-env search-var))))) ;;; apply-env : Env * Var -> SchemeVal (define apply-env (lambda (env search-var) (env search-var)))
Do we need to write new tests?
This idea generalizes to immutable ADTs with more than one observer. Instead of representing the values of the ADT as procedures, we would represent them as objects, with one method for each of the observers.
The simplest programming language we will study is the lambda calculus, whose syntax is specified by
Lc-exp
::= Identifier
::=(lambda (
Identifier)
Lc-exp)
::=(
Lc-expLc-exp
)
To manipulate programs written in the lambda calculus, we may regard the syntax of the lambda calculus as an abstract data type with three kinds of constructors (one for each production in the grammar above) and two kinds of observers: predicates and extractors.
var-exp
: Var → Lc-exp
lambda-exp
: Var × Lc-exp → Lc-exp
app-exp
: Lc-exp × Lc-exp → Lc-exp
var-exp?
: Lc-exp → Bool
lambda-exp?
: Lc-exp → Bool
app-exp?
: Lc-exp → Bool
var-exp->var
: Lc-exp → Var
lambda-exp->bound-var
: Lc-exp → Var
lambda-exp->body
: Lc-exp → Lc-exp
app-exp->rator
: Lc-exp → Lc-exp
app-exp->rand
: Lc-exp → Lc-exp
The design you have just seen illustrates a recipe for
- Include one constructor for each kind of data.
- Include one predicate for each kind of data.
- Include one extractor for each piece of data that is passed to a constructor of the data type.
The define-datatype
syntax provided by our
EOPL language helps to automate the recipe above. Instead
of defining the predicates and extractors, however, the
define-datatype
approach would have us use
a cases
expression to distinguish between
the different kinds of data and to extract subcomponents
of values.
(define-datatype lc-exp lc-exp? (var-exp (var identifier?)) (lambda-exp (bound-var identifier?) (body lc-exp?)) (app-exp (rator lc-exp?) (rand lc-exp?)))
The data types defined by define-datatype
are not abstract if their interface tells clients to
use the cases
syntax, because that syntax
works only if the data type was defined using
define-datatype
. If the implementation
of the data type were changed so it no longer used
define-datatype
, then client code would
stop working. In other words, clients that use the
cases
syntax are not representation-independent.
On the other hand, we can use define-datatype
to implement abstract data types, so long as the interface
does not reveal our use of define-datatype
.
For example, we can implement the Lc-exp ADT
by combining the use of define-datatype
above with
;;; var-exp : Var -> Lc-exp ;;; lambda-exp : Var * Lc-exp -> Lc-exp ;;; app-exp : Lc-exp * Lc-exp -> Lc-exp ;;; var-exp? : Lc-exp -> Bool (define var-exp? (lambda (x) (and (lc-exp? x) (cases lc-exp x (var-exp (var) #t) (lambda-exp (bound-var body) #f) (app-exp (rator rand) #f))))) ;;; lambda-exp? : Lc-exp -> Bool (define lambda-exp? (lambda (x) (and (lc-exp? x) (cases lc-exp x (var-exp (var) #f) (lambda-exp (bound-var body) #t) (app-exp (rator rand) #f))))) ;;; app-exp? : Lc-exp -> Bool (define app-exp? (lambda (x) (and (lc-exp? x) (cases lc-exp x (var-exp (var) #f) (lambda-exp (bound-var body) #f) (app-exp (rator rand) #t))))) ;;; var-exp->var : Lc-exp -> Var (define var-exp->var (lambda (x) (cases lc-exp x (var-exp (var) var) (lambda-exp (bound-var body) (eopl:error 'var-exp->var "illegal argument")) (app-exp (rator rand) (eopl:error 'var-exp->var "illegal argument"))))) ;;; lambda-exp->bound-var : Lc-exp -> Var (define lambda-exp->bound-var (lambda (x) (cases lc-exp x (var-exp (var) (eopl:error 'lambda-exp->bound-var "illegal argument")) (lambda-exp (bound-var body) bound-var) (app-exp (rator rand) (eopl:error 'lambda-exp->bound-var "illegal argument"))))) ;;; lambda-exp->body : Lc-exp -> Lc-exp (define lambda-exp->body (lambda (x) (cases lc-exp x (var-exp (var) (eopl:error 'lambda-exp->body "illegal argument")) (lambda-exp (bound-var body) body) (app-exp (rator rand) (eopl:error 'lambda-exp->body "illegal argument"))))) ;;; app-exp->rator : Lc-exp -> Lc-exp (define app-exp->rator (lambda (x) (cases lc-exp x (var-exp (var) (eopl:error 'app-exp->rator "illegal argument")) (lambda-exp (bound-var body) (eopl:error 'app-exp->rator "illegal argument")) (app-exp (rator rand) rator)))) ;;; app-exp->rand : Lc-exp -> Lc-exp (define app-exp->rand (lambda (x) (cases lc-exp x (var-exp (var) (eopl:error 'app-exp->rand "illegal argument")) (lambda-exp (bound-var body) (eopl:error 'app-exp->rand "illegal argument")) (app-exp (rator rand) rand))))
The verbosity of the implementation above shows you why we
will usually give up on abstract data types when we use
define-datatype
.
It would not be hard to design a datatype definition facility
that resembles define-datatype
but is far more
convenient for defining abstract data types. If I were the
author of our textbook, we would use a different datatype
definition facility and would make far greater use of
abstract data types. If we were to use abstract data types
throughout this course, however, then most of the code in
the book wouldn't work. Given the unfortunate choice between
following the book and using its concrete data types, versus
throwing the book away so we can use abstract data types,
we have reluctantly concluded we should follow the book.
For a while, at least.
The data types defined by define-datatype
may be mutually recursive, which is a feature we'll need
for languages more complex than lambda calculus.
Most computer programs are represented as plain text. For example, the following text is a computer program written in Standard ML:
3 + 4 * 5 - 6
Plain text is a convenient representation when you're writing, reading, editing, or printing a program, but it is an extremely inconvenient (and inefficient) representation when you're executing a program. It is also an inconvenient representation when you're trying to describe the semantics (meaning) of a program, or when you're trying to prove that a program has some property, or when you're trying to study the principles of programming languages in general.
When you interpret or compile the plain text representation of a program, the interpreter or compiler starts out by translating the program's text into a more convenient representation: an abstract syntax tree. The rest of the interpreter or compiler operates on the abstract syntax tree, referring to the original plain text representation only in error messages.
For a simple language of arithmetic expressions, the abstract syntax trees might belong to the datatype defined by
(define-datatype arithmetic-exp arithmetic-exp? (constant-exp (num integer?)) (addition-exp (operand1 arithmetic-exp?) (operand2 arithmetic-exp?)) (subtraction-exp (operand1 arithmetic-exp?) (operand2 arithmetic-exp?)) (multiplication-exp (operand1 arithmetic-exp?) (operand2 arithmetic-exp?)) (var-exp (id symbol?)))
For efficiency,
the translation from plain text to abstract syntax trees
is usually divided into two steps: scanning and
parsing.
The algorithms used for scanning and parsing are beyond
the scope of this course, but you can learn about them
by taking the compiler course, CS G262.
In this course we will use semi-automatic scanner and
parser generators to define a single procedure, usually
named scan&parse
, that takes a string
containing the plain text representation of a program
and returns its abstract syntax tree. For example,
(scan&parse "3 + 4 * 5 - 6")
would return the value of the following expression:
(addition-exp (constant-exp 3) (subtraction-exp (multiplication-exp (constant-exp 4) (constant-exp 5)) (constant-exp 6)))
Unparsing is usually easier than parsing:
(define unparse-arithmetic-exp (lambda (exp) (cases arithmetic-exp exp (constant-exp (n) (number->string n)) (addition-exp (exp1 exp2) (string-append (unparse-arithmetic-exp exp1) " + " (unparse-arithmetic-exp exp2))) (subtraction-exp (exp1 exp2) (string-append (unparse-arithmetic-exp exp1) " - " (unparse-arithmetic-exp exp2))) (multiplication-exp (exp1 exp2) (string-append (unparse-arithmetic-exp exp1) " * " (unparse-arithmetic-exp exp2))) (var-exp (id) (symbol->string id)))))
For example:
> (define exp3456 (addition-exp (constant-exp 3) (subtraction-exp (multiplication-exp (constant-exp 4) (constant-exp 5)) (constant-exp 6)))) > (unparse-arithmetic-exp exp3456) "3 + 4 * 5 - 6"
The first programming languages we will study are expression languages. We will use SLLgen grammars to specify the syntax of these languages and the representations of their abstract syntax trees. We will then specify the semantics of these languages by writing interpreters for the abstract syntax trees. These interpreters take an environment as their second argument, which records the value of any variables that may appear free within the expression.
(value-of
exp
ρ)
= val
means the value of expression exp in environment ρ should be val.
The source language is the language we are defining, specifying, or implementing. The implementation language (usually Scheme with EoPL extensions) is the language in which we write our interpreters.
The front end of an interpreter or compiler translates the source language into abstract syntax trees. A compiler translates abstract syntax trees into some target language, such as Intel x86-32 machine code or JVM byte code. The abstract syntax trees or target language can then be executed by some interpreter. For example, an Intel Core 2 Duo contains an extremely efficient interpreter for Intel x86-32 machine code:
> (define add1 (lambda (n) (+ n 1))) > add1 #<PROCEDURE add1> > (nasm-disassemble add1) 00000000 83FB04 cmp ebx,byte +0x4 00000003 7411 jz 0x16 00000005 C7452C04000000 mov dword [ebp+0x2c],0x4 0000000C FF9500020000 call near [ebp+0x200] 00000012 90 nop 00000013 90 nop 00000014 EBEA jmp short 0x0 00000016 F6C103 test cl,0x3 00000019 750A jnz 0x25 0000001B 89CB mov ebx,ecx 0000001D 83C304 add ebx,byte +0x4 00000020 710E jno 0x30 00000022 83EB04 sub ebx,byte +0x4 00000025 89CB mov ebx,ecx 00000027 B804000000 mov eax,0x4 0000002C FF551C call near [ebp+0x1c] 0000002F 90 nop 00000030 C3 ret 0 > (add1 (expt 10 70)) 10000000000000000000000000000000000000000000000000000000000000000000001
Our interpreters will not be as efficient as the Intel Core 2 Duo, but they will be much simpler, much easier to build, and much easier to understand.
Scanning divides the plain text of a source program into meaningful substrings called tokens. The tokens are described by a lexical specification.
Parsing translates the sequence of tokens into an abstract syntax tree. The syntactically legal sequences of tokens are described by the source language's grammar.
A parser generator is a program whose inputs include a lexical specification, a grammar, and a description of the abstract syntax trees to be constructed for each production of the grammar. The main outputs of the parser generator are a scanner and parser.
We will use the SLLgen parser generator for most of this
course. For MP3, however, the mp3-data-structures.scm
will contain a hand-written scanner and a complete parser
that was generated by a different parser generator.
This is just to show you what a scanner and parser look like.
In future assignments, where the scanners and parsers will be
more complicated, you will see the lexical specifications and
the grammars but will not see the scanners and parsers built
from them.
The main thing to remember is that scan&parse
takes a string containing the plain text representation of a
program, and returns the abstract syntax tree for that program.
Program | ::= |
Expression | a-program (exp1) |
Expression | ::= |
Number | const-exp (num) |
::= |
-( Expression
, Expression) |
diff-exp (exp1 exp2) |
|
::= |
zero? ( Expression) |
zero?-exp (exp1) |
|
::= |
if Expression
then Expression
else Expression |
if-exp (exp1 exp2 exp3) |
|
::= |
Identifier | var-exp (var) |
|
::= |
let Identifier
= Expression
in Expression |
let-exp (var exp1 body) |
For example,
(scan&parse "let x = 4 in -(x,-(1,x))")
evaluates to the abstract syntax tree that is the result of
(a-program (let-exp 'x (const-exp 4) (diff-exp (var-exp 'x) (diff-exp (const-exp 1) (var-exp 'x)))))
For any programming language, the expressed values are the possible values of an expression, and the denoted values are the values to which a variable can be bound in some environment.
For LET, the expressed and denoted values happen to be the same:
ExpVal = Int + Bool
DenVal = Int + Bool
The expressed and denoted values will be abstract data types with this algebraic specification:
num-val
: Int → ExpVal
bool-val
: Bool → ExpVal
expval->num
: ExpVal → Int
expval->bool
: ExpVal → Bool
(expval->num (num-val
n))
= n
(expval->bool (bool-val
b))
= b
We use the following abbreviations:
ρ ranges over environments
[] denotes the empty environment
[var = val]ρ denotes(extend-env
varval
ρ
)
[var = val] denotes [var = val][]
const-exp
: Int → Exp
zero?-exp
: Exp → Exp
if-exp
: Exp × Exp × Exp → Exp
diff-exp
: Exp × Exp → Exp
var-exp
: Symbol → Exp
let-exp
: Symbol × Exp × Exp → Exp
value-of
: Exp × Env → ExpVal
(value-of (const-exp
n)
ρ)
=(num-val
n)
(value-of (var-exp
var)
ρ)
=(apply-env
ρvar
)
(value-of (diff-exp
exp1exp2
)
ρ)
=(- (expval->num (value-of
exp1ρ
)) (expval->num (value-of
exp2ρ
)))
For LET, specifying the behavior of programs amounts to specifying the initial environment. For most programming languages, the initial environment consists of a standard set of predefined libraries that every implementation of the language is supposed to provide. For LET, we'll mimic that by providing three predefined identifiers.
(value-of-program
exp)
=(value-of
expρ0
)
where
ρ0 = [i=1,v=5,x=10
]
(value-of
exp1ρ
)
= val1
(expval->num
val1)
= 0
------------------------------------
(value-of
exp1ρ
)
=(bool-val #t)
(value-of
exp1ρ
)
= val1
(expval->num
val1)
= n
n ≠ 0
------------------------------------
(value-of
exp1ρ
)
=(bool-val #f)
(value-of
exp1ρ
)
= val1
(expval->bool
val1)
=#t
----------------------------------------------------
(value-of (if-exp
exp1exp2
exp3
)
ρ)
=(value-of
exp2ρ
)
(value-of
exp1ρ
)
= val1
(expval->bool
val1)
=#f
----------------------------------------------------
(value-of (if-exp
exp1exp2
exp3
)
ρ)
=(value-of
exp3ρ
)
let
(value-of
exp1ρ
)
= val1
------------------------------------
(value-of (let-exp
varexp1
body
)
ρ)
=(value-of
body[var=val1]ρ
)
Last updated 30 January 2008.