Phases of a Compiler

0.0 - Style Guide

Blue Text is reserved for definitions

Red Text is resolved for errors

Purple Text is reserved for things that may be referred to later.

1.0 - Course Profile

Require COMP3506 as we need to represent the state of our compiler and other things, we use an abstract syntax tree
- COMP3506 as a prerequisite also implies that you’ve done CSSE1001 and CSSE2002

2.0 - Phases of a Compiler

1_Phases_of_a_Compiler-Ian_Hayes_2019.pdf

In the figure to the right, the phases of a compiler are represented by the blue boxes.
The yellow boxes are the inputs and outputs to the compiler.

Untitled

Our first input to the compiler is the Source Code, which we will treat as a text file or a sequence of characters.

We then perform lexical analysis, that is we split up that text file into lexical tokens such as:
- Keywords: if, then, else
- Symbols: brackets, numbers, identifiers and variables
We then perform syntax analysis which is effectively the parsing phase of compilation
- Identify the structure of our program based on the lexical tokens obtained in the previous step
- Declarations defined in the code + predefined declarations are stored in the Symbol Table.
- In this phase, we also identify any syntactical errors / syntax errors that people have made
We then perform type checking or static analysis
- Resolve all of the types in the symbol table
- Update abstract syntax tree with type information and coercion that needs to occur
- In this phase, we also identify any type errors that people have made.
We then perform code generation which generates code for the target machine / architecture.

2.1 - Lexical Analysis

🌱 Given a series of characters representing a program, generate a series of lexical tokens.

Input A sequence of characters representing a program
Output A sequence of lexical tokens
Lexical tokens identifiers, numbers, keywords (if, while), symbols
Ignores whitespace blanks, tabs, newlines, carriage returns, form feeds
Comments are treated as whitespace
Whitespace is still important as it is used to delimit lexical tokens.

2.2 - Syntax Analysis

🌱 Given a series of lexical tokens, generate an abstract syntax tree and a symbol table.

Input A series of lexical tokens
Output An abstract syntax tree and symbol table
Symbol table
- Contains important information about all identifiers that are defined within the program (plus a few predefined ones)
- May be organised into scopes e.g. identifiers defined within a procedure (may be organised hierarchically)

2.3 - Type Checking (Static Semantic Analysis)

Checking for type and other static semantic errors
Input Symbol table and abstract syntax tree
Output Updated symbol table and abstract syntax tree
Resolves all references to identifiers
- Updates the symbol table entries with type information
- Checks abstract syntax tree for type correctness
- Updates abstract syntax tree with type coercions

2.4 - Code Generation

Input Symbol table and updated abstract syntax tree
Output code for the target machine
May include
- Machine-independent optimisation
- Machine-dependent optimisations
- Instruction selection
- Register allocation
At this point, we know that the syntax and static semantics of our program (source code) is correct.

3.0 - Interpreters vs Compilers

Interpreters are essentially the same as compilers, but they don’t perform the last step of code generation - the outputs are the symbol table and abstract syntax tree.

3.1 - Interpreter

Input Symbol table and updated abstract syntax tree
Interprets the abstract syntax tree directly to execute the program
- The program being interpreted may have inputs and outputs
Less time compiling (no code generation)
Slower to execute the program
More commonly used for high-level dynamically typed languages.

4.0 - PL0 Basic Syntax

2_PL0-CSyntax.pdf

By default, we have two predefined types - boolean and int
Suppose we re-defined an integer variable res in the procedure sum as indicated by the greyed out
- This would overshadow the scope of the variable res defined at the start of the code.
Semicolons are used as statement separator and not statement separators
- In the first instance, we need the statement separator
- In the second instance we don’t as we don’t need to separate between any other statements.
```
begin
	i := 0; res := 0;
  // Other statements
  i := i + 1
end
```
When defining a while loop, if there are multiple statements in its body, it gets wrapped in a begin and end clause.
```
while i != x do
  begin
  res := res + i;
  i := i + 1
  end
end
```

const C = 42; // in PL0 we can define constants
type          // We can also define types
  S = [0..C]; // Subrange 0 to 42
var
  b: boolean; // Boolean variable
  r: int;     // Integer variable
  x: S;       // Subrange type (defined above)
  res: int;   // Another integer
procedure sum() = 
  // Procedures start out with declarations
  var
    i: S; // Local variable to sum
    res: int; // Masks the scope of other `res`
  begin // Begin summing
    i := 0; res := 0;
    while i != x do
      begin
      res := res + i;
      i := i + 1
      end
    end
begin // main
  read r;
  b := ( r <= C); // b is a boolean
  if b then
    x := r
  else
    x := 0;
  call sum();
  write res
end

4.1 - PL0’s Lexical Tokens

4.1.1 - An Aside on Regular Expressions

Lexical tokens are also called terminal symbols (as opposed to non-terminal symbols, terminal symbols are the leaf nodes of a parse tree)
Lexical tokens for PL0 are defined in this table by grammar rules which have regular expressions as their right sides
A regular expression is either
- A string in quotes (strictly the concatenation of characters)
- A sequence of regular expressions representing their concatenation
- A set of alternative regular expressions separated by “|”
- A regular expression in parenthesis, “(” and “)” indicating a grouping
- A regular expression followed by a “*” indicating zero or more occurrences of the construction
Repetition has the highest precedence followed by concatenation and then alternation (Aho 1, Sect 3.3), (Louden 4, Section 2.2)

4.1.2 - List of PL0 Lexical Tokens

There are two special tokens:
1. EOF representing the end of an input file
2. ILLEGAL representing any illegal token: Any character that doesn’t match the tokens defined below and isn’t part of a comment or whitespace.

ASSIGN → “:=”

COLON → “:”

SEMICOLON → “;”

RANGE → “…”

LPAREN → “(”

RPAREN → “)”

LBRACKET → “[”

RBRACKET → “]”

EQUALS → “=”

NEQUALS → “!=”

GEQUALS → “>=”

GREATER → “>”

LEQUALS → “<=”

LESS → “<”

PLUS → “+”

MINUS → “−”

TIMES → “*”

DIVIDE → “/”

KW BEGIN → “begin”

KW CALL → “call”

KW CONST → “const”

KW DO → “do”

KW ELSE → “else”

KW END → “end”

KW IF → “if”

KW PROCEDURE → “procedure”

KW READ → “read”

KW THEN → “then”

KW TYPE → “type”

KW VAR → “var”

KW WHILE → “while”

KW WRITE → “write”

NUMBER → Digit Digit∗ [1]

IDENTIFIER → Alpha (Alpha | Digit)∗ [2]

Note here that the constructs Alpha and Digit are not lexical tokens for PL0; They are abbreviations for decimal digits and alphabetical characters, respectively, used in the definition of the tokens NUMBER and IDENTIFIER
- DIGIT → “0” | “1” | “2” | “3” | “4”| “5” | “6” | “7” | “8” | “9”
- ALPHA → “a” | “b” | “c” | “d” | “e”| “f” | “g” | “h” | “i” | “j”| “k” | “l” | “m” | “n” | “o”| “p” | “q” | “r” | “s” | “t” | “u” | “v” | “w” | “x” | “y” | “z” | “A” | “B” | “C” | “D” | “E”| “F” | “G” | “H” | “I” | “J”| “K” | “L” | “M” | “N” | “O” | “P” | “Q” | “R” | “S” | “T” | “U” | “V” | “W” | “X” | “Y” | “Z”
- The REGEX definition of PL0’s NUMBER token is Digit Digit* which essentially means a digit, followed by one or more digits.
- The REGEX definition of PL0’s IDENTIFIER token is Alpha (Alpha | Digit)* which essentially means an alphanumeric character followed by one or more alphanumeric character or digit.
To solve ambiguities in the list of lexical tokens, the lexical analyser
1. First matches the longest string e.g. “>=” matches as $\text{GEQUALS}$ rather than as “>” ( $\text{GREATER}$ ) followed by “=” ( $\text{EQUALS}$ ) and then
2. Second, matches the earliest of the rules e.g. “begin” matches as $\text{KW\_BEGIN}$ rather than as $\text{IDENTIFIER}$ , but note that “beginx” matches as identifier as it is no longer a match

5.0 - PL0 Compiler Example

3_PL0-CompilerExample.pdf

5.1 - Lexical Analysis

Suppose we have the following input (sequence of characters):
```
if x < 0 then 
	z := -x 
else 
	z := x
```

The output of lexical analysis would be

KW_IF, IDENTIFIER("x"), LESS, NUMBER(0), KW_THEN, 
IDENTIFIER("z"), ASSIGN, MINUS, IDENTIFIER("x"), 
KW_ELSE, IDENTIFIER("z"), ASSIGN, IDENTIFIER("x")

The grammar of the language can be used to extract the structure of the program statement (from its sequence of lexical tokens).
The structure can be described using a concrete syntax tree