Phases of a Compiler

CourseCompilers & Interpreters
SemesterS1 2022

Phases of a Compiler

0.0 - Style Guide

Blue Text is reserved for definitions

Red Text is resolved for errors

Purple Text is reserved for things that may be referred to later.

1.0 - Course Profile

  • Require COMP3506 as we need to represent the state of our compiler and other things, we use an abstract syntax tree
    • COMP3506 as a prerequisite also implies that you’ve done CSSE1001 and CSSE2002

2.0 - Phases of a Compiler

1_Phases_of_a_Compiler-Ian_Hayes_2019.pdf

  • In the figure to the right, the phases of a compiler are represented by the blue boxes.
  • The yellow boxes are the inputs and outputs to the compiler.

Untitled

  • Our first input to the compiler is the Source Code, which we will treat as a text file or a sequence of characters.
  1. We then perform lexical analysis, that is we split up that text file into lexical tokens such as:
    • Keywords: if, then, else
    • Symbols: brackets, numbers, identifiers and variables
  2. We then perform syntax analysis which is effectively the parsing phase of compilation
    • Identify the structure of our program based on the lexical tokens obtained in the previous step
    • Declarations defined in the code + predefined declarations are stored in the Symbol Table.
    • In this phase, we also identify any syntactical errors / syntax errors that people have made
  3. We then perform type checking or static analysis
    • Resolve all of the types in the symbol table
    • Update abstract syntax tree with type information and coercion that needs to occur
    • In this phase, we also identify any type errors that people have made.
  4. We then perform code generation which generates code for the target machine / architecture.

2.1 - Lexical Analysis

🌱 Given a series of characters representing a program, generate a series of lexical tokens.

  • Input A sequence of characters representing a program
  • Output A sequence of lexical tokens
  • Lexical tokens identifiers, numbers, keywords (if, while), symbols
  • Ignores whitespace blanks, tabs, newlines, carriage returns, form feeds
  • Comments are treated as whitespace
  • Whitespace is still important as it is used to delimit lexical tokens.

2.2 - Syntax Analysis

🌱 Given a series of lexical tokens, generate an abstract syntax tree and a symbol table.

  • Input A series of lexical tokens
  • Output An abstract syntax tree and symbol table
  • Symbol table
    • Contains important information about all identifiers that are defined within the program (plus a few predefined ones)
    • May be organised into scopes e.g. identifiers defined within a procedure (may be organised hierarchically)

2.3 - Type Checking (Static Semantic Analysis)

  • Checking for type and other static semantic errors
  • Input Symbol table and abstract syntax tree
  • Output Updated symbol table and abstract syntax tree
  • Resolves all references to identifiers
    • Updates the symbol table entries with type information
    • Checks abstract syntax tree for type correctness
    • Updates abstract syntax tree with type coercions

2.4 - Code Generation

  • Input Symbol table and updated abstract syntax tree
  • Output code for the target machine
  • May include
    • Machine-independent optimisation
    • Machine-dependent optimisations
    • Instruction selection
    • Register allocation
  • At this point, we know that the syntax and static semantics of our program (source code) is correct.

3.0 - Interpreters vs Compilers

  • Interpreters are essentially the same as compilers, but they don’t perform the last step of code generation - the outputs are the symbol table and abstract syntax tree.

3.1 - Interpreter

  • Input Symbol table and updated abstract syntax tree
  • Interprets the abstract syntax tree directly to execute the program
    • The program being interpreted may have inputs and outputs
  • Less time compiling (no code generation)
  • Slower to execute the program
  • More commonly used for high-level dynamically typed languages.

4.0 - PL0 Basic Syntax

2_PL0-CSyntax.pdf

  • By default, we have two predefined types - boolean and int

  • Suppose we re-defined an integer variable res in the procedure sum as indicated by the greyed out

    • This would overshadow the scope of the variable res defined at the start of the code.
  • Semicolons are used as statement separator and not statement separators

    • In the first instance, we need the statement separator
    • In the second instance we don’t as we don’t need to separate between any other statements.
    begin
    	i := 0; res := 0;
      // Other statements
      i := i + 1
    end
    
  • When defining a while loop, if there are multiple statements in its body, it gets wrapped in a begin and end clause.

    while i != x do
      begin
      res := res + i;
      i := i + 1
      end
    end
    
const C = 42; // in PL0 we can define constants
type          // We can also define types
  S = [0..C]; // Subrange 0 to 42
var
  b: boolean; // Boolean variable
  r: int;     // Integer variable
  x: S;       // Subrange type (defined above)
  res: int;   // Another integer
procedure sum() = 
  // Procedures start out with declarations
  var
    i: S; // Local variable to sum
    res: int; // Masks the scope of other `res`
  begin // Begin summing
    i := 0; res := 0;
    while i != x do
      begin
      res := res + i;
      i := i + 1
      end
    end
begin // main
  read r;
  b := ( r <= C); // b is a boolean
  if b then
    x := r
  else
    x := 0;
  call sum();
  write res
end

4.1 - PL0’s Lexical Tokens

4.1.1 - An Aside on Regular Expressions

  • Lexical tokens are also called terminal symbols (as opposed to non-terminal symbols, terminal symbols are the leaf nodes of a parse tree)
  • Lexical tokens for PL0 are defined in this table by grammar rules which have regular expressions as their right sides
  • A regular expression is either
    • A string in quotes (strictly the concatenation of characters)
    • A sequence of regular expressions representing their concatenation
    • A set of alternative regular expressions separated by β€œ|”
    • A regular expression in parenthesis, β€œ(” and β€œ)” indicating a grouping
    • A regular expression followed by a β€œ*” indicating zero or more occurrences of the construction
  • Repetition has the highest precedence followed by concatenation and then alternation (Aho 1, Sect 3.3), (Louden 4, Section 2.2)

4.1.2 - List of PL0 Lexical Tokens

  • There are two special tokens:
    1. EOF representing the end of an input file
    2. ILLEGAL representing any illegal token: Any character that doesn’t match the tokens defined below and isn’t part of a comment or whitespace.

ASSIGN β†’ β€œ:=”

COLON β†’ β€œ:”

SEMICOLON β†’ β€œ;”

RANGE β†’ β€œβ€¦β€

LPAREN β†’ β€œ(”

RPAREN β†’ β€œ)”

LBRACKET β†’ β€œ[”

RBRACKET β†’ β€œ]”

EQUALS β†’ β€œ=”

NEQUALS β†’ β€œ!=”

GEQUALS β†’ β€œ>=”

GREATER β†’ β€œ>”

LEQUALS β†’ β€œ<=”

LESS β†’ β€œ<”

PLUS β†’ β€œ+”

MINUS β†’ β€œβˆ’β€

TIMES β†’ β€œ*”

DIVIDE β†’ β€œ/”

KW BEGIN β†’ β€œbegin”

KW CALL β†’ β€œcall”

KW CONST β†’ β€œconst”

KW DO β†’ β€œdo”

KW ELSE β†’ β€œelse”

KW END β†’ β€œend”

KW IF β†’ β€œif”

KW PROCEDURE β†’ β€œprocedure”

KW READ β†’ β€œread”

KW THEN β†’ β€œthen”

KW TYPE β†’ β€œtype”

KW VAR β†’ β€œvar”

KW WHILE β†’ β€œwhile”

KW WRITE β†’ β€œwrite”

NUMBER β†’ Digit Digitβˆ— [1]

IDENTIFIER β†’ Alpha (Alpha | Digit)βˆ— [2]

  • Note here that the constructs Alpha and Digit are not lexical tokens for PL0; They are abbreviations for decimal digits and alphabetical characters, respectively, used in the definition of the tokens NUMBER and IDENTIFIER
    • DIGIT β†’ β€œ0” | β€œ1” | β€œ2” | β€œ3” | β€œ4”| β€œ5” | β€œ6” | β€œ7” | β€œ8” | β€œ9”
    • ALPHA β†’ β€œa” | β€œb” | β€œc” | β€œd” | β€œe”| β€œf” | β€œg” | β€œh” | β€œi” | β€œj”| β€œk” | β€œl” | β€œm” | β€œn” | β€œo”| β€œp” | β€œq” | β€œr” | β€œs” | β€œt” | β€œu” | β€œv” | β€œw” | β€œx” | β€œy” | β€œz” | β€œA” | β€œB” | β€œC” | β€œD” | β€œE”| β€œF” | β€œG” | β€œH” | β€œI” | β€œJ”| β€œK” | β€œL” | β€œM” | β€œN” | β€œO” | β€œP” | β€œQ” | β€œR” | β€œS” | β€œT” | β€œU” | β€œV” | β€œW” | β€œX” | β€œY” | β€œZ”
    • The REGEX definition of PL0’s NUMBER token is Digit Digit* which essentially means a digit, followed by one or more digits.
    • The REGEX definition of PL0’s IDENTIFIER token is Alpha (Alpha | Digit)* which essentially means an alphanumeric character followed by one or more alphanumeric character or digit.
  • To solve ambiguities in the list of lexical tokens, the lexical analyser
    1. First matches the longest string e.g. β€œ>=” matches as GEQUALS\text{GEQUALS} rather than as β€œ>” (GREATER\text{GREATER}) followed by β€œ=” (EQUALS\text{EQUALS}) and then
    2. Second, matches the earliest of the rules e.g. β€œbegin” matches as KW_BEGIN\text{KW\_BEGIN} rather than as IDENTIFIER\text{IDENTIFIER}, but note that β€œbeginx” matches as identifier as it is no longer a match

5.0 - PL0 Compiler Example

3_PL0-CompilerExample.pdf

5.1 - Lexical Analysis

  • Suppose we have the following input (sequence of characters):

    if x < 0 then 
    	z := -x 
    else 
    	z := x
    
  • The output of lexical analysis would be

    KW_IF, IDENTIFIER("x"), LESS, NUMBER(0), KW_THEN, 
    IDENTIFIER("z"), ASSIGN, MINUS, IDENTIFIER("x"), 
    KW_ELSE, IDENTIFIER("z"), ASSIGN, IDENTIFIER("x")
    
  • The grammar of the language can be used to extract the structure of the program statement (from its sequence of lexical tokens).

  • The structure can be described using a concrete syntax tree