C Compiler

McGill University 2021

Created a C Compiler, in Java from scratch. This included lexer, parser, AST generator, semantic analyzer, and regalloc. This was an individual project. Some boilerplate code by Christophe Dubach.

Sadly, as this is for a class at McGill, and the course content is static, it is not open source.

Takeaways

A compiler is actually just a massive collection of very straightforward things. There weren't too many crazy algorithms or bits of Java syntax required.

Writing such a large scale program myself gave me practice with abstracting and focusing on different parts of a program at a time, and learning where to look when debugging. In contrast to a larger, group project where its probably someone else's fault if something was broken outside of the section I'm currently working in, here I had to switch gears if I discovered a bug in something previously written.

Since register allocation is really just graph building and coloring, I got lots of hands on experience with graph related algorithms. It was also nice to apply theoretical syntax, semantic, and typing rules I've encountered in other classes (although can be much tricker than they are here).

This project was a big confidence builder in my coding abilities.
Obviously, I learned a lot about how Compilers function behind the scenes.
Surprisingly, I did not learn anything new about C, since I was not actually writing the compiler in C.

What's Going On

If you're not familiar with compiler design, Wikipedia is a good place to start

Lexer

By far the most simple part of the compiler, the Lexer generates tokens from input text.

The Tokenizer was basically a long switch statement with checks: whitespace had to be ignored, and comment strings recognized.

The Token object contains additional information, like position and character type (char, int)

A simple Scanner was created, feeding data into the Tokenizer, which then created Token objects

Parser

Verify the syntax of the EBNF Grammar representing a subset of C - "Mini C"

The grammar was converted to an LL-k grammar. As a non-trivial part of the project, it is omitted.

A recursive descent parser was used to ensure operator precedence using Kleene closures

This code does all the parsing for a program. Remember that in C, all execution is done through a main() function

The parser checks syntax and stores parsed information as objects in an Abstract Syntax Tree

AST

Abstract grammar for Mini C in EBNF

All Abstract Syntax Tree (AST) nodes are represented with Java classes. A single Program object is the root of an AST. These Java classes also contain fields with semantic information.

Once constructed, the Visitor design pattern was used on the AST.

A printer was implemented to display the AST of a program
Semantic analysis was also done with a Visitor

The following analysis was done on the AST:

  • Name Analysis: Ensure that the scoping and visibility rules of the language are respected.
  • Global and Local Scopes: Identifiers declared in the global scope can be accessed anywhere in the program, A variable declared in a block is accessible in the block and all inner blocks of that block, but not accessible outside the block. In both cases, it is illegal to declare twice the same identifiers in the same current block.
  • Shadowing: An identifier declared within a given scope has the same name as an identifier declared in an outer scope.
  • Built-in functions: void print_s(char* s); void print_i(int i); void print_c(char c); char read_c(); int read_i(); void* mcmalloc(int size);
  • Type Analysis: Using formal typing rules for Mini C to perform static type-checking.
  • L-value checking: lvalues (left-values) are the only expressions that can appear on the left-hand side of an assignment or as an argument to the address-of operator (&).

Code Generation

After verifying the semantics, code needs to be generated to execute the program. The program is compiled into MIPS in order to use MARS, a cross platform MIPS simulator used for educational purposes.

We will now produce temporary MIPS code, using virtual registers. Virtual registers are representations of real registers, but there is no limit to how many can be used. They can be naively interacted with by saving/loading their contents onto the stack everytime they are used, however this is very slow.

Global variables are statically stored in memory using the MIPS .data directive.
Local variables are stored in memory on the stack, using $sp manipulation.

Data padding is an important consideration, as `sizeof(char)==1` `sizeof(int)==4` `sizeof(int*)==4`, but all fields must align at a 4 byte (32bit) boundary.

Branching is done by converting conditionals into a combination of j b and bne instructions.

Struct/Array access and assignment is done using lw and sw instructions and considering an offset.

Function calls are done using a j label instruction, and pushing pushing $s $rp onto the stack so they are not overwritten

Stdlib functions are performed using MIPS syscall

Local, static variables (that never have an adressofoperation called on them), are represented as virtual registers to allow them to be optimized in register allocation. Only the variables that need to be in memory are ever in memory, the ones that can be handled through registers are and the stack is avoided when possible.

Register Allocation

Now that we have a complete MIPS program using virtual registers, we need to allocate real registers to our program. MIPS provides 18 variable registers $s0-s9 $t0-t7, where s-registers are guaranteed to be saved after a funcall. MIPS also includes system registers $sp $ip $mul etc

In order to keep track of what is being executed when, a Control-Flow-Graph (CFG) is created.

Liveness Analysis is performed on the CFG, with each node of the CFG keeping a liveIn and liveOut set This information is then used to create an Interference Graph, where edges represent a point in the program execution where both virtual registers are alive at the same time.

Note that the CFG nodes are instructions, the Interference Graph nodes are virtual registers. Big difference!

Finally I implemented Chaitin's Algorithm for Graph Colouring, which includes a spill function, where if we were unable to properly colour the interference graph, spill is saved onto the stack.