Week 2.2

CourseCompilers & Interpreters
SemesterS1 2022

Week 2.2

0.0 - Sections

  1. Recursive Descent Parsing

1.0 - Backus-Naur Form (BNF) and EBNF

1.1 - Backus-Naur Form (BNF)

  • The Backus-Naur Form (BNF) allows a context-free grammar to be described by a finite set of productions of the form

    NαN\rightarrow\alpha

    • Where NN is a single non-terminal symbol

                $\alpha$ is a (possibly empty) sequence of terminal and non-terminal symbols.
      

1.1.1 - BNF Syntax for Alternatives

  • The set of productions

    Nα1Nα2NαmN\rightarrow\alpha_1\\ N\rightarrow\alpha_2\\ \cdots\\ N\rightarrow\alpha_m\\

  • Can be written as

    Nα1α2αmN\rightarrow \alpha_1|\alpha_2|\cdots|\alpha_m

1.2 - Extended Backus-Naur Form (EBNF)

🌱 Whilst EBNF doesn’t extend the expressivity of BNF, it adds three extra constructs to our BNF form to make it easier to specify our grammars. Since these constructs don’t add any expressivity, we can convert any grammar defined in EBNF into BNF.

  1. An optional syntactic construct, written in square brackets

    RelConditionExp[RelOp Exp]\text{RelCondition}\rightarrow \text{Exp} [\text{RelOp Exp}]

  2. A repetition construct, written in curly brackets

    • The items in the brackets can appear 0 or more times.

    StatementListStatement{SEMICOLON Statement}\text{StatementList}\rightarrow\text{Statement}\{\text{SEMICOLON Statement}\}

  3. A grouping of constructs, written in parentheses

    TermFactor{(TIMES | DIVIDE) Factor}\text{Term}\rightarrow\text{Factor} \{\text{(TIMES | DIVIDE) Factor}\}

1.2.1 - Converting EBNF Constructs to BNF

  • We can replace the optional construct [S][S] by a new non-terminal OptSOptS

    • Essentially, anywhere where we have the construct [S][S] we replace with the new non-terminal symbol OptSOptS

    OptSSϵOptS\rightarrow S|\epsilon

    • For example, we can re-write the following production:

      RelConditionExp[RelOp Exp]\text{RelCondition}\rightarrow \text{Exp} [\text{RelOp Exp}]

    • as the following productions

      RelConditionExp OptRelExpOptRelExpRelOp Exp  ϵ\text{RelCondition}\rightarrow \text{Exp} \text{ OptRelExp}\\ \text{OptRelExp}\rightarrow \text{RelOp}\ \text{Exp}\ |\ \epsilon

  • We can replace the repetition construction {S}\{S\} by a new non-terminal RepSRepS

    RepS(S) RepS  ϵRepS\rightarrow (S)\ RepS\ |\ \epsilon

    • For example, we can re-write the following production

      TermFactor{(TIMES | DIVIDE)FACTOR}\text{Term}\rightarrow\text{Factor}\{(\text{TIMES | DIVIDE}) \text{FACTOR}\}

    • as the following production

      TermFactor RepFRepF((TIMES | DIVIDE) Factor) RepF  ϵ\text{Term}\rightarrow\text{Factor RepF}\\ \text{RepF}\rightarrow((\text{TIMES | DIVIDE) Factor) RepF}\ |\ \epsilon

    • Note here that we still have the EBNF grouping construct

      • We could remove the outer grouping construct without changing the definition of the production, however the same can’t be said for the inner grouping construct.
  • We can replace the grouping construct (S)(S) by a new non-terminal symbol GrpSGrpS

    GrpSSGrpS\rightarrow S

    • For example, we can re-write

      RepF(TIMES | DIVIDE) Factor RepF  ϵ\text{RepF}\rightarrow\text{(TIMES | DIVIDE) Factor RepF}\ |\ \epsilon

    • as

      RepFTermOp Factor RepF  ϵTermOp TIMES | DIVIDERepF\rightarrow TermOp\ Factor\ RepF\ |\ \epsilon\\ TermOp\rightarrow\ \text{TIMES | DIVIDE}

2.0 - Recursive-Descent Parsing

  • How can we use the grammar of a language to write a parser for our programming language?
    • We want to implement the Syntax Analysis Phase of our compiler
  • A recursive-descent parser is a recursive program to recognise sentences in a language:
    • Input is a stream of lexical tokens represented by tokens of type TokenStream (a Java class)
    • Each non-terminal symbol NN has a method parseN that:
      • Recognises the longest string of lexical tokens in the input stream, starting from the current token, derivable from NN;
      • As it parses the input it moves the current location forward;
      • When it has finished parsing NN, the current token is the token immediately following the last token matched as part of NN

2.1 - Pattern-Matching a Terminal Symbol

  • To recognise T, the parser calls the method tokens.match on the input token stream, with the terminal symbol Token.T as a parameter:

    ENBF Construct SRecogniser code recog(S)Notes
    Ttokens.match(Token.T);Make sure the token is what we want it to be, consume it and move on
    IDENTIFIERtokens.match(Token.IDENTIFIER)

2.2 - Pattern-Matching a Non-Terminal Symbol

  • To recognise a non-terminal symbol N, simply call the corresponding method

    EBNF Construct SRecogniser code recog(S)Notes
    NparseN();Call the corresponding parse method for the non-terminal symbol.
    RelConditionparseRelCondition();When parsing a RelCondition, we’d call the parseRelCondition(); method to parse it.

2.3 - Matching a Sequence of Terms

🌱 So we have these tokens that have been recognised. How do we parse it?

  • A sequence S1 S2  SnS_1\ S_2\ \cdots\ S_n of EBNF factors is recognised by code that recognises each in a sequence
EBNF Construct SRecogniser code recog(S)
S1 S2  SnS_1\ S_2\ \cdots\ S_nrecog(S1S_1)
recog(S2S_2)
recog(S3S_3)
LPAREN RelCondition RPARENtokens.match(Token.LPAREN);
parseRelCondition();
tokens.match(Token.RPAREN);

2.4 - Matching a list of Alternatives

🌱 To implement this, we need to look at the current token + the following tokens - it’s not necessarily always possible to write a recursive descent parser that just looks at the current token.

  • When matching a list of alternatives, S1S2SnS_1|S_2|\cdots|S_n, we need to know which alternative to choose
  • Let First(SiS_i) be the set of terminal symbols that can start the sequence SiS_i
  • We can predict this based on the value of the current token if: (i.e. without looking ahead)
    • First(SiS_i) and First(SjS_j) are disjoint for i,j1,,ni,j\in1,\cdots,n and iji\ne j
    • At most one alternative is nullable
    • If one of the alternatives is nullable, the set of terminal symbols that may immediately follow is disjoint from Sii1,,nS_i \forall i \in1,\cdots,n
    • A nullable symbol is a non-terminal symbol that can produce the empty sequence of symbols ϵ\epsilon
  • Assuming that our grammar satisfies these restrictions,
EBNF Construct SRecogniser code recog(S)
$S_1\\ S_2\
recog($S_1$)

} else if (tokens.isIn(First(S2S_2))) { recog(S2S_2) } else … \vdots } else if (tokens.isIn(First(SnS_n))) { recog(SnS_n) } else { /* Omitted if Nullable*/ errors.error(”Syntax Error”); } |

2.4.1 - Matching a List of Alternatives - Example

FactorLPAREN RelCondition RPAREN | NUMBER | LValueFactor\rightarrow\text{LPAREN RelCondition RPAREN | NUMBER | LValue}

void parseFactor(){
  if (tokens.isMatch(Token.LPAREN)) {
    tokens.match(Token.LPAREN);
    parseRelCondition();
    tokens.match(Token.RPAREN);
  } else if (tokens.isMatch(Token.NUMBER)) {
    tokens.match(Token.NUMBER);
  } else if (tokens.isMatch(Token.IDENTIFIER)) {
		parseLValue();
  } else {
    errors.error("Syntax Error");
    /* Or if we have a symbol that's Nullable, assume that we've matched the 
     * Nullable symbol.
     */
  }
}

2.5 - Matching an Optional Construct

  • When matching an optional construct, [S][S], we need to choose between recognising SS and matching the empty sequence.
  • We can predict this based on the value of the current token if
    • First(S) is disjoint from the set of tokens that immediately follows
  • Assuming that our grammar satisfies this restriction,
EBNF Construct SRecogniser code recog(S)
[S][S]if (tokens.isIn(First(S1S_1)) {
recog($S_1$)

} /* Else don’t do anything. */ |

2.5.1 - Matching an Optional Construct - Example

RelConditionExp[RelOp Exp]\text{RelCondition} \rightarrow\text{Exp[RelOp Exp]}

void parseRelCondition() {
  parseExp(); // Recognise the expression
  if (tokens.isIn(REL_OPS_SET)) { // Can the current token start the RelOp
    parseRelOp();
    parseExp();
  }
}

Where REL_OPS_SET contains the set of terminal symbols representing relational operators.


IfStatementKW_IF Condition KW_THEN Statement [KW_ELSE Statement]\text{IfStatement}\rightarrow\text{KW\_IF Condition KW\_THEN Statement [KW\_ELSE Statement]}

void parseIfStatement() {
  tokens.match(Token.KW_IF);
  parseCondition();
  tokens.match(Token.KW_THEN);
  parseStatement();
  if (tokens.isMatch(Token.KW_ELSE)) {
    tokens.match(Token.KW_ELSE);
    parseStatement();
  }
}

🌱 Whilst the grammar doesn’t satisfy the rules required, we can use our parser to resolve that ambiguity (matching else to the closest IF)

2.6 - Matching a Repetition Construct

  • When matching a repetition construct, {S}\{ S\}, we need to choose when to stop recognising occurrences of SS, and match the empty sequence instead.

  • We can predict this based on the value of the current token if:

    • First(SS) is disjoint from the set of tokens that immediately follow
  • Assuming that our grammar satisfies this restriction:

    EBNF Construct SRecogniser code recog(S)
    {S}\{S\}while (tokens.isIn(First(S))) {
      recog(S)
    

    } |

2.6.1 - Matching a Repetition Construct - Example

TermFactor{(TIMES | DIVIDE) Factor}\text{Term}\rightarrow\text{Factor\{(TIMES | DIVIDE) Factor\}}

void parseTerm(){
  parseFactor();
  while(tokens.isIn(TERMS_OPS_SET)){
		// Step 1: Match either a Tokens.TIMES or Tokens.DIVIDE or throw an error
    if(tokens.isMatch(Tokens.TIMES)){
      tokens.match(Tokens.TIMES);
    } else if (tokens.isMatch(Token.DIVIDE)) {
      tokens.match(Tokens.DIVIDE);
    } else {
      fatal("Internal Error); // Theoretically unreachable
    }
    // Step 2: Match the factor
		parseFactor();
  }
}

2.7 - Summary

  • We have outlined a systematic process for writing a recursive-descent parser that recognises sentences in the language of a context-free grammar (of a restricted form) written in EBNF
  • However, we are recognising the grammatical structure with a purpose, e.g. to
    • Construct an Abstract Syntax Tree or
    • Calculate the value of an expression

3.0 - Parsing and Computing Expressions

🌱 Let’s now write a recursive descent parser that not only recognises an expression, but calculates it assuming that there are no syntacticial errors.

  • The following is a simple grammar for calculator expressions:

    Exp[PLUS | MINUS] Term{(PLUS | MINUS) TERM}TermFactor {(TIMES | DIVIDE) Factor}FactorLPAREN Exp RPAREN | NUMBER\text{Exp}\rightarrow \text{[PLUS | MINUS] Term} \{(\text{PLUS | MINUS) TERM}\}\\ \text{Term}\rightarrow\text{Factor \{(TIMES | DIVIDE) Factor\}}\\ \text{Factor}\rightarrow\text{LPAREN Exp RPAREN | NUMBER}

  1. Firstly, let’s look at parsing a factor. The production for a Factor\text{Factor} is:

    FactorLPAREN Exp RPAREN | NUMBER\text{Factor}\rightarrow\text{LPAREN Exp RPAREN | NUMBER}

    parseFactor() {
      
      if (tokens.isMatch(Token.LPAREN)) {
        tokens.match(Token.LPAREN);
                 parseExp();
        tokens.match(Token.RPAREN);
      } else if (tokens.isMatch(Token.NUMBER)) {
        
        tokens.match(Token.NUMBER);
      } else {
        errors.error("Syntax error");
      }
    
    }
    
  2. Evaluate the expression and store the result in a local variable.

    parseFactor() {
      int result;
      if (tokens.isMatch(Token.LPAREN)) {
        tokens.match(Token.LPAREN);
        result = parseExp();
        tokens.match(Token.RPAREN);
      } else if (tokens.isMatch(Token.NUMBER)) {
        result = tokens.getIntValue(); // Get the integer value of the current token,
                                       // assuming it's a number
        tokens.match(Token.NUMBER);    // Moves to the next token, so the line before 
                                       // MUST be before
      } else {
        errors.error("Syntax error");
        result = 0x8080808080;         // Set to a random garbage value 
      }
      return result;
    }
    
  3. Next, let’s look at parsing a Term\text{Term}

    TermFactor {(TIMES | DIVIDE) Factor}\text{Term}\rightarrow\text{Factor \{(TIMES | DIVIDE) Factor\}}

    int parseTerm() {
      int result;
      result = parseFactor(); // Result returned if times or divide not in expression
      while (tokens.isIn(TERMS_OPS_SET))) { // Times or divide
        boolean times;
        if (tokens.isMatch(Token.TIMES)) {
          tokens.match(Token.TIMES);
          times = true;
        } else if (tokens.isMatch(Token.DIVIDE)) {
          tokens.match(Token.DIVIDE);
          times = false;
        } else {
          fatal("Internal Error"); // Unreachable.
        }
    		int factor = parseFactor(); // The inner factor
    		if (times) {
          result = result * factor;
        } else {
          result = result / factor;
        }
      }
    	return result;
    }
    
  4. Next, let’s look at parsing a Exp[PLUS | MINUS] Term{(PLUS | MINUS) Term}\text{Exp}\rightarrow\text{[PLUS | MINUS] Term\{(PLUS | MINUS) Term\}}

    int parseExp() {
      boolean isPlus = false; 
      boolean isMinus = false;
    
      if (tokens.isMatch(Token.PLUS)) {
        tokens.match(Token.PLUS);
        isPlus = true;
      } else if (tokens.isMatch(Token.MINUS)){
        tokens.match(Token.MINUS);
        isMinus = true;
      }
    
      int term = parseTerm();
      parseExp();
    }
    

4.0 - Building an Abstract Syntax Tree (AST)

  • As well as parsing a language, we would like to create a representation of the language in the form of an Abstract Syntax Tree (AST)
  • To do this, we change the parse method for a non-terminal NN, so that it returns the abstract syntax tree representation of NN, assuming that all of the parse method it calls returns the abstract syntax tree representations of the non-terminals they recognise.

4.1 - Building an Abstract Syntax Tree (AST) - Parsing Example

  • We’re now building an AST instead of parsing the expression
    • We return to the PL0 Compiler Data Structures document and use some of the data structure defined there to create our abstract syntax tree
    • Every expression node ExpNode has a location (Location loc) and a type (discussed later)
      • Location Line number and column number
    • We can represent our operators as BinaryNodes as they have two statements on either side of them.
      • A BinaryNode has the following arguments: arguments

        BinaryNode(Location loc, Type t, Operator op, ExpNode left, ExpNode right)
        

🌱 We now want our parseTerm method to return an ExpNode (Expression Node)

ExpNode parseTerm() {
  ExpNode result = parseFactor();
  while (tokens.isIn(TERM_OPS_SET)) {
    Operator op = Operator.INVALID_OP;
    Location opLoc = tokens.getLocation(); // Location of times or divide. Used in AST.
    if (tokens.isMatch(Token.TIMES)) {
      tokens.match(Token.TIMES);
      op = Operator.MUL_OP;
    } else if (tokens.isMatch(Token.DIVIDE)) {
      tokens.match(Token.DIVIDE);
      op = Operator.DIV_OP;
    } else {
      fatal("Internal error"); // Unreachable
    }
    ExpNode right = parseFactor();
    result = new ExpNode.BinaryNode(opLoc, op, result, right);
  }
  return result;
}

4.2 - Building an AST - Visual Example

🌱 Let’s build the AST for 23/42*3/4

  1. At the start, our current token points to 22.

    • So parseFactor() will return an AST for 22

    • Note that l1l_1 denotes Location 1 in this example but should be replaced by a line number and column number.

      Untitled

  2. Our current token now points to *

    • We recognise that it is either a TIMES or DIVIDE - and so we parse the sequence {(TIMES | DIVIDE) Factor}\text{\{(TIMES | DIVIDE) Factor\}}.

    • We know that the location for the start of this sequence is Location 2

    • We also know that the operation is a multiplication operation (MUL_OP)

      Untitled

  3. Our current token now points to 33

    • We now parse the Factor\text{Factor} component of the sequence {(TIMES | DIVIDE) Factor}\text{\{(TIMES | DIVIDE) Factor\}}

    Untitled

    • We now have the LHS and RHS components of our BinaryOperator, so we can add pointers to them to create the AST for the binary operator.

    Untitled

  4. Our current token now points to // (the divide operator)

    • Our current token is a divide symbol, so we can’t terminate the loop - we have to read in another occurrence of {(TIMES | DIVIDE) Factor}\text{\{(TIMES | DIVIDE) Factor\}}
    • From the information at hand, we know that the operator is located at Location 4 (l4l_4) and that it’s a division operator DIV_OP

    Untitled

    • In reading in the BinaryOperator token, we move our current token pointer to the factor 44.
    • As before, we parse it

    Untitled

  5. Our current pointer now points to nothing / null

    • As before, we put together our Abstract Syntax Tree

    Untitled

  6. We can terminate as the pointer is no longer pointing at a MUL_OP or DIV_OP

4.3 - Recursive Descent Parser So Far…

  • We have looked at:
    • A systematic process for writing a recursive-descent parser that recognises sentences in the language of a context-free grammar (of a restricted form) written in EBNF
    • How to build an abstract syntax tree for the sentence that we are recognising
  • But programmers make errors