A3: Lexing and Parsing

Due: Tuesday, 2/23

Lexing and Parsing with ocamllex/Menhir

Grumpy is a small imperative programming language. For example:

  def fact(x : int) : int {
      if x == 0 then 1
      else if x == 1 then 1 
           else x * fact(x - 1)
  }
  fact(4)

defines a Grumpy function

  def fact(x : int) : int { ... }

the computes the factorial of x. The final line of the program calls fact on the integer 4, with result 24. You'll notice that function-call syntax in Grumpy follows C style rather than OCaml style.

In general, every Grumpy program consists of a number of function definitions followed by an expression (which may call one or more of the defined functions).

  def f(...) : ... { ... }
  def g(...) : ... { ... }
  ...
  def z(...) : ... { ... }
  ... some result expression here ...

Here's a second example that illustrates a couple additional features of Grumpy: mutable references and variable scope:

  def f(x:int, y:bool) : int {   // -+ x and y go into scope
    let z = ref x in             // -+ z goes into scope
    {                            //  |
      let w = !z in              // ---+ w goes into scope
      z := w + 1                 //  | |
    };                           // ---+ w goes out of scope
    !z + 1                       //  |
  }                              // -+ x,y, and z go out of scope
  f(3, false)

The code above defines a function f that takes an int x and a bool y as arguments and returns an integer (type int). The first line (let z = ref x) defines a let-bound mutable reference z initialized to x.

Mutable references are similar to regular old mutable variables in C or C++ (recall that, in general in OCaml, let-bound variables are immutable as we saw in Week 1). One main difference is that, instead of directly reading a mutable variable z (as in x = z in C), you explicitly dereference it first using the ! operator, as in !z + 1 (this expression evaluates to 1 plus whatever value is in memory at address z). Another way to think of references: z is a pointer to a chunk of memory initialized to x. Dereferencing !z is like doing *z in C. Assignments to z, as in z := w + 1, are like doing *z = w + 1. Unlike C pointers, OCaml references are type-safe (operations on references won't go wrong at runtime).

The function's second line introduces a new block scope with brackets { ... }. What's the effect of this scope? Any let-bound variables we declare inside it won't be accessible outside the { ... } (for example, it would be illegal to refer to w in the expression !z + 1, by rewriting it to something like !z + w).

The overall result of the program is 5: Reference z is initialized to 3. Variable w equals 3 in the update to z := w + 1 (= 3 + 1 = 4). Finally, the result of the function is the last sequenced expression in its body, !z + 1 = 4 + 1 = 5.

Your job

in this assignment is to implement a lexer and parser for Grumpy, following the language syntax briefly overviewed above and specified in BNF in the Grumpy language specification.

To implement the lexer, you'll be using ocamllex, a lexer generator.
To implement the parser, you'll be using Menhir, a parser generator for LR(1) grammars.

Before doing any actual programming, read through all the instructions below. And as always, ask early on Piazza if something's unclear!

1. Download the assignment files

First, download the assignment files and unzip the resulting gzipped tarfile into a new directory.

$ tar xzvf a3.tgz

In the resulting directory src you'll find the following file structure:

  src/               -- compiler source files
    Makefile         -- the project Makefile
    _tags            -- the tags file for ocamlbuild
    AST.mli          -- language-independent abstract syntax stuff
    AST.ml           -- associated helper functions
    exp.mli          -- the definition of Grumpy's abstract syntax
    exp.ml           -- associated functions
    lexer.mll        -- ocamllex source file (Part 2)
    parser.mly       -- Menhir source file (Part 3)
    grumpy.ml        -- the toplevel compiler program    
    tests/           -- test cases

To build the project, type

$ make

At this point, you may see a bunch of warnings of the form

  ...
  File "parser.mly", line 13, characters 15-20:
  Warning: the token WHILE is unused.
  Finished, 22 targets (0 cached) in 00:00:00.

That's OK -- it's just Menhir telling you that the token WHILE (defined in parser.mly), and so on for all the other token kinds, is unused.

The build system assumes you have ocamlbuild, ocamlfind, Batteries, etc. installed on your machine -- follow the instructions in A0 if you're still missing one of these packages.

Now try running

$ make test

The tests won't pass yet of course (you haven't yet completed the assignment) so at this point you'll see a bunch of error messages of the form:

  $ make test
  ocamlbuild -use-menhir -use-ocamlfind grumpy.native
  Finished, 22 targets (22 cached) in 00:00:00.
  cd tests && ./run.sh
  test01-unary-negation.gpy:1:2: Unexpected char: -
  *** test01-unary-negation.gpy FAILED ***
  test02-boolean-negation.gpy:1:2: Unexpected char: n
  *** test02-boolean-negation.gpy FAILED ***
  ... followed by many more ...

To run the tests manually, you can do ./run.sh from within the tests directory. Within that directory, you'll also find a bunch of sample Grumpy programs, for example:

  ...
  test50-fractal.gpy
  test50-fractal.gpy.expected
  test51-loopref.gpy
  test51-loopref.gpy.expected

Each Grumpy source program (extension .gpy) is paired with a second file (extension .expected) that gives that program's expected output. You won't use the expected output in this assignment (you're just lexing and parsing) but the output files may be useful for understanding what each program does.

2. lexer.mll

Your job in this part is to build a lexer, using ocamllex, that converts concrete programs such as the .gpy files in the tests directory into lists of tokens for consumption by the parser you'll build in Part 3.

Start by opening lexer.mll; now navigate to the (mostly empty) definition of rule token. You'll see that, initially, it contains only one rule:

  rule token = parse
    | _
    { raise (Syntax_err ("Unexpected char: " ^ Lexing.lexeme lexbuf)) }

No matter what initial character the input file contains, token initially always raises a syntax exception "Unexpected char: ...". The wildcard "_" is the catch-all pattern; the code within the braces beginning raise (Syntax_err ...) defines the action to perform in this case (raise an exception).

In general, each rule in the definition of token pairs a regular expression (on the left) with a chunk of OCaml code (in braces on the right). For example, the following few rules

  rule token = parse
      "//"               { comment lexbuf }
    | ['0'-'9']+ as lxm  { INTCONST(Int32.of_string lxm) }
    | ...

  and comment = parse
      ...                { ... do something ... }
    | ...

lex comments (the definition of the comment rule is elided above -- you'll have to implement it) and 32-bit integer constants. The comment rule is defined mutually recursively -- you're free to add as many additional mutually recursive rules as you like. Note that when comment is called, we pass it the special argument lexbuf, which gives the current state of the lexer buffer.

In the regexp-style pattern

  ['0'-'9']+ as lxm

lxm is bound to whichever strings match the regexp ['0'-'9']+ at lex time (that is, nonempty strings of characters either 0, 1, ..., 9), and can be used within the braces in the right-hand side of the rule. For example, INTCONST(Int32.of_string lxm) returns an INTCONST token (standing for "integer constant") containing the integer interpretation of lxm (Int32.of_string lxm converts lxm to the corresponding 32-bit integer, e.g., Int32.of_string "45" = 45).

Before beginning to actually code, peruse the ocamllex manual to learn more about ocamllex and its syntax; or read through lexer.mll in the calc example I showed in class.

The definition of the tokens themselves is given in parser.mly. Here are the first few:

  %token <int32> INTCONST
  %token <float> FLOATCONST
  %token <bool> BOOLCONST
  %token <string> ID

  %token DEF LET WHILE IF THEN ELSE REF INT FLOAT BOOL UNIT TT IN

The first declarations define token types that contain values of OCaml types. For example, %token <string> ID defines a new token type ID that contains OCaml strings. The last line defines a bunch of token types that contain no OCaml data.

A few hints

Your lexer is probably not done until you've defined a rule to produce each possible token.
If your lexer reads eof, indicating the end of the input buffer, you should return token EOF.
You'll note from reading the Grumpy syntax specification that the language supports two styles of comments (both single-line and nested multiline comments). To support nested multiline comments, it may be useful to know that ocamllex rules (such as token, comment, etc.) can take additional parameters besides lexbuf. For example, you might define a new rule
```
  ...
  and nested_comment level = parse
    ...
```
that takes a "nesting level" as its first argument....
To debug your lexer, try inserting print statements into the { ... } section of each rule. That is, write the rule that lexes int constants not as:
```
  | ['0'-'9']+ as lxm
  { INTCONST(Int32.of_string lxm) }  
```
but as:
```
  | ['0'-'9']+ as lxm
  { print_string "INT(";
    let i = Int32.of_string lxm
    in print_int (Int32.to_int i);
       print_string ")";
       INTCONST(i) }
```
and likewise for the other rules you add (the print debug statements for tokens that don't contain OCaml values will be less complicated).

3. parser.mly

Read AST.mli and exp.mli.

OCaml Module Interfaces (`.mli` files)

So far, we've mostly just seen OCaml files ending in .ml. Files with extension .mli define interfaces rather than implementations (sort of like .h files in C). They're similar to .ml files except that, instead of including function definitions they include include function types. The declaration of function types starts with val instead of let, as in:

  (** Is type [t] an arithmetic type? *)
  val is_arith_ty : ty -> bool

This declaration (and all other such declarations) must be accompanied by a corresponding function definition in AST.ml.

In the assignment code, interface files AST.mli and exp.mli define the abstract syntax of the Grumpy source language. The Grumpy lexer and parser convert concrete Grumpy programs into values of type (AST.ty, unit Exp.exp) AST.prog.

Part of your job in this assignment is to "reverse engineer" this type, from the comments in AST.mli and exp.mli, in order to figure out how to build (abstract syntax) expressions, function definitions, and programs in the "semantic actions" sections of your Menhir parser definition (that is, the chunks of your parsing rules that sit on the right-hand side of each nonterminal rule, within the curly braces).

Now

Open parser.mly and complete the Menhir definition of Grumpy's parser.

In general, your job in this file is to add a number of new nonterminal rules to the Menhir grammar, corresponding to the Grumpy syntax given in the Grumpy spec.

Each such rule will look something like the following:

  unop:
  | MINUS
    { UMinus }
  | NOT
    { UNot }
  | DEREF
    { UDeref }

which defines a new nonterminal called unop (unary operation) with 3 productions, one for each unary operation in the language. The rule

  | MINUS
    { UMinus }

says that the token MINUS is an acceptable unary operator; when a MINUS is parsed, the rule returns, as defined by the code within the braces { ... }, the abstract syntax UMinus. The whole abstract syntax of unary and binary operations (and of identifiers, function definitions, and whole programs) is given in file AST.mli. That file is quite heavily documented; see it for additional details.

Menhir rules can be more complicated. Here's a second example:

  exp_list:
  | l = separated_list(COMMA, exp)
    { l }

Assuming we've defined a nonterminal rule for expressions, called exp, this rule defines a new nonterminal exp_list that parses lists of expressions separated by COMMA tokens.

In the rule, the result of parsing this list (a list of abstract syntax expressions) is bound to variable l, which may then appear within the braces { ... }. In this case, our exp_list rule just returns l. However in general, a rule might do more interesting things with intermediate expressions. For more information on Menhir's special separated_list function (and other useful functions), see Section 5.4 of the Menhir manual, which describes Menhir's "Standard Library" (a collection of useful, pre-defined parsing functions).

Here's one more:

  arg:
  | arg_id = id COLON arg_ty = mytype
    { mk_tid arg_id arg_ty }

This rule defines a nonterminal arg that parses function parameters of the form id COLON ty, e.g. x : int. It assumes we've already defined rules for the nonterminals id and mytype. Within the braces, we return the expression mk_tid arg_id arg_ty, which constructs the abstract syntax corresponding to a typed identifier (an identifier together with it's type; see AST.mli for details).

Precedences and Associativity

Your parser should parse arithmetic, boolean, and other operators according to the precedence and associativity tables given in the Grumpy spec. For example, multiplication should bind tighter than addition.

To encode precedence and associativity in Menhir, add precedence and associativity directives to the top of your parser.mly file (before the %%) as documented in Section 4 of the Menhir manual.

For example, the following pair of directives encode that TIMES and DIV bind tighter than PLUS and MINUS:

  %left PLUS MINUS
  %left TIMES DIV

Precedence directives lower in the file (at higher line numbers) have higher priority (bind tighter) than those appearing earlier. The %left indicates left associativity.

Some hints

You might have trouble defining a nonterminal called type. Try calling it mytype instead.
Unary operations should bind tighter than binary ones. To encode this, you may find it useful to take advantage of Menhir's %prec annotations (Section 4.2.1 of the Menhir manual). To best exploit %prec, try defining a new precedence level associated with no particular nonterminal...%nonassoc unary_over_binary.
In AST.mli, the abstract syntax for function definitions and whole programs is given via record types (think C structs with pattern-matching). To get a handle on records, take a look at an online resource like Chapter 5 of Real World OCaml.

4. Test

To gain confidence in your definitions, ensure that all test cases pass:

$ make test

We won't grade additional test cases, but you're very welcome to add some if you like. If you come up with what you think are particularly nasty tests (e.g., exploiting corner cases), please email them to me (gstewart) or to Sam.

5. Submit

Submit your lexer.mll and parser.mly on or before the due date, via Blackboard.

6. Piazza

Finally: if the instructions in regexps.ml are unclear ask for clarification on Piazza!