Week 1.1

CourseCompilers & Interpreters
SemesterS1 2022

Week 1.1

🏷️ Replace this section with some keywords that can be used for searching, tagging and the indexing of this page.

Sections

🌱 Links to all Heading1 Blocks in this page.

  1. Phases of a Compiler
  2. Compilers vs Interpreters
  3. Syntax of PL0

1.0 - Phases of a Compiler

1_Phases_of_a_Compiler-Ian_Hayes_2019.pdf

  • The figure to the right shows the phases of a compiler as indicated in the blue boxes (the yellow boxes are the inputs and outputs to each phase of the compiler).
  • The first input to the compiler is the source code which is treated as a text file or sequence of characters.

Hayes, I - Phases of a Compiler

Hayes, I - Phases of a Compiler

A Note (From the Textbook)

🌱 I’m adding this in here as it helps me conceptualise the steps performed by a compiler.

  • The steps of a compiler goes through can be broken down into analysis and synthesis

  • In the analysis phase, the compiler breaks up the source program into constituent pieces (tokens) and imposes a grammatical structure on them.

    • Uses the tokens to create an intermediate representation of the source program (in the form of a syntax tree)

    • If a syntactic error is detected in the source program, the compiler must provide informative messages to the user

      Analysis={Lexical Analysis, Syntax Analysis, Type Checking}\text{Analysis}=\{\text{Lexical Analysis, Syntax Analysis, Type Checking}\}

  • In the synthesis phase, the compiler generates the program for the target machine using the intermediate representation (syntax tree) and the information contained in the symbol table.

    Synthesis={Code Generation}\text{Synthesis}=\{\text{Code Generation}\}

1.1 - Lexical Analysis

🌱 Lexical Analysis is the step in the compiler where the source code is converted into lexical tokens

  • Input A sequence of characters representing a program
  • Output Sequence of lexical tokens
  • The lexical analyser ignores whitespace, blanks, tabs, newlines, carriage returns, form feeds
    • Whitespace is still important, as it is used to delimit lexical tokens.
    • Comments are treated as whitespace
  • More Information

1.1.1 - Lexical Analysis Example

3_PL0-CompilerExample.pdf

  • Suppose we have the following input (sequence of characters):

    if x < 0 then 
    	z := -x 
    else 
    	z := x
    
  • The output of lexical analysis would be

    KW_IF, IDENTIFIER("x"), LESS, NUMBER(0), KW_THEN, 
    IDENTIFIER("z"), ASSIGN, MINUS, IDENTIFIER("x"), 
    KW_ELSE, IDENTIFIER("z"), ASSIGN, IDENTIFIER("x")
    
  • The grammar of the language can be used to extract the structure of the program statement (from its sequence of lexical tokens).

  • The structure can be described using a concrete syntax tree

1.2 - Syntax Analysis

🌱 Given a series of lexical tokens, generate an abstract syntax tree and a symbol table.

  • Input Series of lexical tokens generated by the lexical analysis phase.
  • Output An Abstract Syntax Tree (AST) and symbol table
  • Symbol Table contains important information about all of the identifiers that are defined in the program (+ a few predefined ones)
    • May be organised into scopes (e.g. identifiers within a procedure; may be organised hierarchically)
    • more information
  • Syntax Tree An intermediate representation of the target program that depicts the grammatical structure of the text input (source code)
  • In this phase, we also identify any syntactical errors in the source code.

1.3 - Type Checking (Static Semantic Analysis / Static Analysis)

🌱 Check for type and other static semantic errors

  • Input Symbol Table and Abstract Syntax Tree (AST)
  • Output Updated Symbol Table and Abstract Syntax Tree (AST)
  • Resolves all references to identifiers
    • Updates the symbol table entries with type information for each identifier

    • Checks abstract syntax tree for type correctness; type checking is performed

      • e.g. the compiler may check if the value for an array index is an integer, and will report an error if a floating-point number is used as an index into an array.
    • Updates abstract syntax tree with type coercions

      🌱 A binary arithmetic operator may be applied to either a pair of integers or to a pair of floating-point numbers. If the operator is applied to a floating-point number and an integer, the compiler may convert / coerce the integer into a floating-point number.

1.4 - Code Generation

🌱 Given the updated Symbol Table and Abstract Syntax Tree, generate code for the target machine.

  • Input Symbol Table and Updated Abstract Syntax Tree (AST)
  • Output Code for the target machine
  • The code generation phase may include:
    • Machine-independent optimisation
    • Machine-dependent optimisations
    • Instruction selection
    • Register allocation
  • At this point, we know that the syntax and static semantics of our program (source code) is correct.

2.0 - Compilers vs Interpreters

🌱 Interpreters are essentially the same as compilers, except they don’t perform the last step of Code Generation.

  • Compilers output programs compiled for a target machine

2.1 - About Interpreters

🌱 Interprets the abstract syntax tree directly to execute the program

  • Input Symbol Table and Updated Abstract Syntax Tree (AST)
  • Less time spent on the compilation step (as compared to a Compiler) as no code generated
  • Slower to execute the program (as code generation is done statement by statement as required)
  • More commonly used for high-level dynamically typed languages
  • Can typically give better error diagnostics than a compiler because it executes the statements in the source program sequentially.

3.0 - Syntax of PL0

🌱 PL0 is an educational language, used for teaching Compilers and Interpreters. It is a subset of the Pascal language.

  • There are two default types in Pascal - boolean and integer

    var
      b: boolean; // Boolean
      r: int;     // Integer
    
  • We can also define our own types:

    const C = 42; // A constant integer
    type
      S = [0..C];
    var
      x: S; // Subrange type
    
  • ln PL0, semicolons (;) are used as statement separators not statement terminators.

    begin
      i := 0; 
      res := 0;
      // Other statements here
      i := i + 1 // This statement doesn't need a semicolon as there no statements after
    end
    
  • When defining a while loop, if there are multiple statements in its body, it gets wrapped in a begin and end clause.

    while i != x do
       res += res + i;
    end
    
    
    // For multiple statements;
    while i != x do
      begin
      res := res + i;
      i := i + 1
      end
    end
    
const C = 42; // in PL0 we can define constants
type          // We can also define our own variable types
  S = [0..C]; // Subrange 0 to 42
var
  b: boolean; // Boolean variable
  r: int;     // Integer variable
  x: S;       // Subrange type (defined above)
  res: int;   // Another integer
procedure sum() = 
  // Procedures start out with declarations
  var
    i: S; // Local variable to sum
    res: int; // Masks the scope of other `res` (if we choose to define res again here)
  begin 
    i := 0; res := 0;
    while i != x do
      begin
      res := res + i;
      i := i + 1
      end // while loop contents
    end // while loop
begin // main
  read r;
  b := ( r <= C); // b is a boolean
  if b then
    x := r
  else
    x := 0;
  call sum();
  write res
end
ASSIGN β†’ β€œ:=”GEQUALS β†’ β€œ>=”KW_BEGIN β†’ β€œbegin”KW_TYPE β†’ β€œtype”
COLON β†’ β€œ:”GREATER β†’ β€œ>”KW_CALL β†’ β€œcall”KW_VAR β†’ β€œvar”
RANGE β†’ β€œβ€¦β€LEQUALS β†’ β€œ<=”KW_CONST β†’ β€œconst”KW_WHILE β†’ β€œwhile”
LPAREN β†’ β€œ(”LESS β†’ β€œ<”KW_DO β†’ β€œdo”KW_WRITE β†’ β€œwrite”
RPAREN β†’ β€œ)”PLUS β†’ β€œ+”W_END β†’ β€œend”NUMBER β†’ Digit Digitβˆ— [1]
LBRACKET β†’ β€œ[”PLUS β†’ β€œ+”KW_IF β†’ β€œif”IDENTIFIER β†’ Alpha (Alpha
RBRACKET β†’ β€œ]”MINUS β†’ β€œβˆ’β€KW_PROCEDURE β†’ β€œprocedure”
EQUALS β†’ β€œ=”TIMES β†’ β€œ*”KW_READ β†’ β€œread”
NEQUALS β†’ β€œ!=”DIVIDE β†’ β€œ/”KW_THEN β†’ β€œthen”

Concrete Syntax Tree Example