Phases of a Compiler
Phases of a Compiler
0.0 - Style Guide
Blue Text is reserved for definitions
Red Text is resolved for errors
Purple Text is reserved for things that may be referred to later.
1.0 - Course Profile
- Require COMP3506 as we need to represent the state of our compiler and other things, we use an
abstract syntax tree- COMP3506 as a prerequisite also implies that youβve done CSSE1001 and CSSE2002
2.0 - Phases of a Compiler
1_Phases_of_a_Compiler-Ian_Hayes_2019.pdf
- In the figure to the right, the phases of a compiler are represented by the
blue boxes. - The
yellow boxesare the inputs and outputs to the compiler.

- Our first input to the compiler is the Source Code, which we will treat as a text file or a sequence of characters.
- We then perform
lexical analysis, that is we split up that text file into lexical tokens such as:- Keywords: if, then, else
- Symbols: brackets, numbers, identifiers and variables
- We then perform
syntax analysiswhich is effectively the parsing phase of compilation- Identify the structure of our program based on the lexical tokens obtained in the previous step
- Declarations defined in the code + predefined declarations are stored in the
Symbol Table. - In this phase, we also identify any
syntactical errors / syntax errorsthat people have made
- We then perform
type checkingor static analysis- Resolve all of the types in the
symbol table - Update
abstract syntax treewith type information and coercion that needs to occur - In this phase, we also identify any
type errorsthat people have made.
- Resolve all of the types in the
- We then perform
code generationwhich generates code for the target machine / architecture.
2.1 - Lexical Analysis
π± Given a series of characters representing a program, generate a series of lexical tokens.
InputA sequence of characters representing a programOutputA sequence oflexical tokensLexical tokensidentifiers, numbers, keywords (if, while), symbolsIgnores whitespaceblanks, tabs, newlines, carriage returns, form feeds- Comments are treated as whitespace
- Whitespace is still important as it is used to delimit lexical tokens.
2.2 - Syntax Analysis
π± Given a series of lexical tokens, generate an abstract syntax tree and a symbol table.
InputA series of lexical tokensOutputAnabstract syntax treeand symbol tableSymbol table- Contains important information about all identifiers that are defined within the program (plus a few predefined ones)
- May be organised into scopes e.g. identifiers defined within a procedure (may be organised hierarchically)
2.3 - Type Checking (Static Semantic Analysis)
- Checking for type and other static semantic errors
InputSymbol tableandabstract syntax treeOutputUpdatedsymbol tableandabstract syntax tree- Resolves all references to identifiers
- Updates the symbol table entries with type information
- Checks abstract syntax tree for type correctness
- Updates abstract syntax tree with type coercions
2.4 - Code Generation
InputSymbol tableand updatedabstract syntax treeOutputcode for the target machine- May include
- Machine-independent optimisation
- Machine-dependent optimisations
- Instruction selection
- Register allocation
- At this point, we know that the syntax and static semantics of our program (source code) is correct.
3.0 - Interpreters vs Compilers
- Interpreters are essentially the same as compilers, but they donβt perform the last step of code generation - the outputs are the
symbol tableandabstract syntax tree.
3.1 - Interpreter
InputSymbol tableand updatedabstract syntax tree- Interprets the abstract syntax tree directly to execute the program
- The program being interpreted may have inputs and outputs
- Less time compiling (no code generation)
- Slower to execute the program
- More commonly used for high-level dynamically typed languages.
4.0 - PL0 Basic Syntax
2_PL0-CSyntax.pdf
By default, we have two predefined types -
booleanandintSuppose we re-defined an integer variable
resin the procedure sum as indicated by the greyed out- This would overshadow the scope of the variable res defined at the start of the code.
Semicolons are used as statement separator and not statement separators
- In the first instance, we need the statement separator
- In the second instance we donβt as we donβt need to separate between any other statements.
begin i := 0; res := 0; // Other statements i := i + 1 endWhen defining a while loop, if there are multiple statements in its body, it gets wrapped in a
beginandendclause.while i != x do begin res := res + i; i := i + 1 end end
const C = 42; // in PL0 we can define constants
type // We can also define types
S = [0..C]; // Subrange 0 to 42
var
b: boolean; // Boolean variable
r: int; // Integer variable
x: S; // Subrange type (defined above)
res: int; // Another integer
procedure sum() =
// Procedures start out with declarations
var
i: S; // Local variable to sum
res: int; // Masks the scope of other `res`
begin // Begin summing
i := 0; res := 0;
while i != x do
begin
res := res + i;
i := i + 1
end
end
begin // main
read r;
b := ( r <= C); // b is a boolean
if b then
x := r
else
x := 0;
call sum();
write res
end
4.1 - PL0βs Lexical Tokens
4.1.1 - An Aside on Regular Expressions
- Lexical tokens are also called terminal symbols (as opposed to non-terminal symbols, terminal symbols are the leaf nodes of a parse tree)
- Lexical tokens for PL0 are defined in this table by grammar rules which have
regular expressionsas their right sides - A regular expression is either
- A string in quotes (strictly the concatenation of characters)
- A sequence of regular expressions representing their concatenation
- A set of alternative regular expressions separated by β|β
- A regular expression in parenthesis, β(β and β)β indicating a grouping
- A regular expression followed by a β*β indicating zero or more occurrences of the construction
- Repetition has the highest precedence followed by concatenation and then alternation (Aho 1, Sect 3.3), (Louden 4, Section 2.2)
4.1.2 - List of PL0 Lexical Tokens
- There are two special tokens:
EOFrepresenting the end of an input fileILLEGALrepresenting any illegal token: Any character that doesnβt match the tokens defined below and isnβt part of a comment or whitespace.
ASSIGN β β:=β
COLON β β:β
SEMICOLON β β;β
RANGE β ββ¦β
LPAREN β β(β
RPAREN β β)β
LBRACKET β β[β
RBRACKET β β]β
EQUALS β β=β
NEQUALS β β!=β
GEQUALS β β>=β
GREATER β β>β
LEQUALS β β<=β
LESS β β<β
PLUS β β+β
MINUS β βββ
TIMES β β*β
DIVIDE β β/β
KW BEGIN β βbeginβ
KW CALL β βcallβ
KW CONST β βconstβ
KW DO β βdoβ
KW ELSE β βelseβ
KW END β βendβ
KW IF β βifβ
KW PROCEDURE β βprocedureβ
KW READ β βreadβ
KW THEN β βthenβ
KW TYPE β βtypeβ
KW VAR β βvarβ
KW WHILE β βwhileβ
KW WRITE β βwriteβ
NUMBER β Digit Digitβ [1]
IDENTIFIER β Alpha (Alpha | Digit)β [2]
- Note here that the constructs Alpha and Digit are not lexical tokens for PL0; They are abbreviations for decimal digits and alphabetical characters, respectively, used in the definition of the tokens NUMBER and IDENTIFIER
- DIGIT β β0β | β1β | β2β | β3β | β4β| β5β | β6β | β7β | β8β | β9β
- ALPHA β βaβ | βbβ | βcβ | βdβ | βeβ| βfβ | βgβ | βhβ | βiβ | βjβ| βkβ | βlβ | βmβ | βnβ | βoβ| βpβ | βqβ | βrβ | βsβ | βtβ | βuβ | βvβ | βwβ | βxβ | βyβ | βzβ | βAβ | βBβ | βCβ | βDβ | βEβ| βFβ | βGβ | βHβ | βIβ | βJβ| βKβ | βLβ | βMβ | βNβ | βOβ | βPβ | βQβ | βRβ | βSβ | βTβ | βUβ | βVβ | βWβ | βXβ | βYβ | βZβ
- The REGEX definition of PL0βs NUMBER token is
Digit Digit*which essentially means a digit, followed by one or more digits. - The REGEX definition of PL0βs IDENTIFIER token is
Alpha (Alpha | Digit)*which essentially means an alphanumeric character followed by one or more alphanumeric character or digit.
- To solve ambiguities in the list of lexical tokens, the lexical analyser
- First matches the longest string e.g. β>=β matches as rather than as β>β () followed by β=β () and then
- Second, matches the earliest of the rules e.g. βbeginβ matches as rather than as , but note that βbeginxβ matches as identifier as it is no longer a match
5.0 - PL0 Compiler Example
3_PL0-CompilerExample.pdf
5.1 - Lexical Analysis
Suppose we have the following input (sequence of characters):
if x < 0 then z := -x else z := xThe output of lexical analysis would be
KW_IF, IDENTIFIER("x"), LESS, NUMBER(0), KW_THEN, IDENTIFIER("z"), ASSIGN, MINUS, IDENTIFIER("x"), KW_ELSE, IDENTIFIER("z"), ASSIGN, IDENTIFIER("x")The grammar of the language can be used to extract the structure of the program statement (from its sequence of lexical tokens).
The structure can be described using a concrete syntax tree