0.0 - Sections

Recursive Descent Parsing

1.0 - Backus-Naur Form (BNF) and EBNF

1.1 - Backus-Naur Form (BNF)

The Backus-Naur Form (BNF) allows a context-free grammar to be described by a finite set of productions of the form
$N\rightarrow\alpha $$`$
Where $N$ is a single non-terminal symbol and $\alpha$ is a (possibly empty) sequence of terminal and non-terminal symbols.

1.1.1 - BNF Syntax for Alternatives

The set of productions
$N\rightarrow\alpha_1\\ N\rightarrow\alpha_2\\ \cdots\\ N\rightarrow\alpha_m\\$
Can be written as
$N\rightarrow \alpha_1|\alpha_2|\cdots|\alpha_m$

1.2 - Extended Backus-Naur Form (EBNF)

🌱 Whilst EBNF doesn’t extend the expressivity of BNF, it adds three extra constructs to our BNF form to make it easier to specify our grammars. Since these constructs don’t add any expressivity, we can convert any grammar defined in EBNF into BNF.

An optional syntactic construct, written in square brackets
$\text{RelCondition}\rightarrow \text{Exp} [\text{RelOp Exp}]$
A repetition construct, written in curly brackets
- The items in the brackets can appear 0 or more times.
$\text{StatementList}\rightarrow\text{Statement}\{\text{SEMICOLON Statement}\}$
A grouping of constructs, written in parentheses
$\text{Term}\rightarrow\text{Factor} \{\text{(TIMES | DIVIDE) Factor}\}$

1.2.1 - Converting EBNF Constructs to BNF

We can replace the optional construct $[S]$ by a new non-terminal $OptS$
- Essentially, anywhere where we have the construct $[S]$ we replace with the new non-terminal symbol $OptS$
$OptS\rightarrow S|\epsilon$
- For example, we can re-write the following production:
  $\text{RelCondition}\rightarrow \text{Exp} [\text{RelOp Exp}]$
- as the following productions
  $\text{RelCondition}\rightarrow \text{Exp} \text{ OptRelExp}\\ \text{OptRelExp}\rightarrow \text{RelOp}\ \text{Exp}\ |\ \epsilon$
We can replace the repetition construction $\{S\}$ by a new non-terminal $RepS$
$RepS\rightarrow (S)\ RepS\ |\ \epsilon$
- For example, we can re-write the following production
  $\text{Term}\rightarrow\text{Factor}\{(\text{TIMES | DIVIDE}) \text{FACTOR}\}$
- as the following production
  $\text{Term}\rightarrow\text{Factor RepF}\\ \text{RepF}\rightarrow((\text{TIMES | DIVIDE) Factor) RepF}\ |\ \epsilon$
- Note here that we still have the EBNF grouping construct
  - We could remove the outer grouping construct without changing the definition of the production, however the same can’t be said for the inner grouping construct.
We can replace the grouping construct $(S)$ by a new non-terminal symbol $GrpS$
$GrpS\rightarrow S$
- For example, we can re-write
  $\text{RepF}\rightarrow\text{(TIMES | DIVIDE) Factor RepF}\ |\ \epsilon$
- as
  $RepF\rightarrow TermOp\ Factor\ RepF\ |\ \epsilon\\ TermOp\rightarrow\ \text{TIMES | DIVIDE}$

2.0 - Recursive-Descent Parsing

How can we use the grammar of a language to write a parser for our programming language?
- We want to implement the Syntax Analysis Phase of our compiler
A recursive-descent parser is a recursive program to recognise sentences in a language:
- Input is a stream of lexical tokens represented by tokens of type TokenStream (a Java class)
- Each non-terminal symbol $N$ $N$ has a method parseN that:
  - Recognises the longest string of lexical tokens in the input stream, starting from the current token, derivable from $N$ ;
  - As it parses the input it moves the current location forward;
  - When it has finished parsing $N$ , the current token is the token immediately following the last token matched as part of $N$

2.1 - Pattern-Matching a Terminal Symbol

To recognise T, the parser calls the method tokens.match on the input token stream, with the terminal symbol Token.T as a parameter:
ENBF Construct S Recogniser code recog(S) Notes
T tokens.match(Token.T); Make sure the token is what we want it to be, consume it and move on
IDENTIFIER tokens.match(Token.IDENTIFIER)

ENBF Construct S	Recogniser code recog(S)	Notes
T	tokens.match(Token.T);	Make sure the token is what we want it to be, consume it and move on
IDENTIFIER	tokens.match(Token.IDENTIFIER)

2.2 - Pattern-Matching a Non-Terminal Symbol

To recognise a non-terminal symbol N, simply call the corresponding method

EBNF Construct S	Recogniser code recog(S)	Notes
N	parseN();	Call the corresponding parse method for the non-terminal symbol.
RelCondition	parseRelCondition();	When parsing a RelCondition, we’d call the parseRelCondition(); method to parse it.

2.3 - Matching a Sequence of Terms

🌱 So we have these tokens that have been recognised. How do we parse it?

A sequence $S_1\ S_2\ \cdots\ S_n$ of EBNF factors is recognised by code that recognises each in a sequence

EBNF Construct S	Recogniser code recog(S)
$S_1\ S_2\ \cdots\ S_n$	recog( $S_1$ )
recog( $S_2$ )
recog( $S_3$ )
LPAREN RelCondition RPAREN	tokens.match(Token.LPAREN);
parseRelCondition();
tokens.match(Token.RPAREN);

2.4 - Matching a list of Alternatives

🌱 To implement this, we need to look at the current token + the following tokens - it’s not necessarily always possible to write a recursive descent parser that just looks at the current token.

When matching a list of alternatives, $S_1|S_2|\cdots|S_n$ , we need to know which alternative to choose
Let First( $S_i$ ) be the set of terminal symbols that can start the sequence $S_i$
We can predict this based on the value of the current token if: (i.e. without looking ahead)
- First( $S_i$ ) and First( $S_j$ ) are disjoint for $i,j\in1,\cdots,n$ and $i\ne j$
- At most one alternative is nullable
- If one of the alternatives is nullable, the set of terminal symbols that may immediately follow is disjoint from $S_i \forall i \in1,\cdots,n$
- A nullable symbol is a non-terminal symbol that can produce the empty sequence of symbols $\epsilon$
Assuming that our grammar satisfies these restrictions,

EBNF Construct S	Recogniser code recog(S)
$S_1\	\ S_2\

recog($S_1$)

} else if (tokens.isIn(First( $S_2$ ))) { recog( $S_2$ ) } else … $\vdots$ } else if (tokens.isIn(First( $S_n$ ))) { recog( $S_n$ ) } else { /* Omitted if Nullable*/ errors.error(”Syntax Error”); } |

2.4.1 - Matching a List of Alternatives - Example

$Factor\rightarrow\text{LPAREN RelCondition RPAREN | NUMBER | LValue}$

void parseFactor(){
  if (tokens.isMatch(Token.LPAREN)) {
    tokens.match(Token.LPAREN);
    parseRelCondition();
    tokens.match(Token.RPAREN);
  } else if (tokens.isMatch(Token.NUMBER)) {
    tokens.match(Token.NUMBER);
  } else if (tokens.isMatch(Token.IDENTIFIER)) {
		parseLValue();
  } else {
    errors.error("Syntax Error");
    /* Or if we have a symbol that's Nullable, assume that we've matched the 
     * Nullable symbol.
     */
  }
}

2.5 - Matching an Optional Construct

When matching an optional construct, $[S]$ , we need to choose between recognising $S$ and matching the empty sequence.
We can predict this based on the value of the current token if
- First(S) is disjoint from the set of tokens that immediately follows
Assuming that our grammar satisfies this restriction,

EBNF Construct S	Recogniser code recog(S)
$[S]$	if (tokens.isIn(First( $S_1$ )) {

recog($S_1$)

} /* Else don’t do anything. */ |

2.5.1 - Matching an Optional Construct - Example

$\text{RelCondition} \rightarrow\text{Exp[RelOp Exp]}$

void parseRelCondition() {
  parseExp(); // Recognise the expression
  if (tokens.isIn(REL_OPS_SET)) { // Can the current token start the RelOp
    parseRelOp();
    parseExp();
  }
}

Where REL_OPS_SET contains the set of terminal symbols representing relational operators.

$\text{IfStatement}\rightarrow\text{KW\_IF Condition KW\_THEN Statement [KW\_ELSE Statement]}$

void parseIfStatement() {
  tokens.match(Token.KW_IF);
  parseCondition();
  tokens.match(Token.KW_THEN);
  parseStatement();
  if (tokens.isMatch(Token.KW_ELSE)) {
    tokens.match(Token.KW_ELSE);
    parseStatement();
  }
}

🌱 Whilst the grammar doesn’t satisfy the rules required, we can use our parser to resolve that ambiguity (matching else to the closest IF)

2.6 - Matching a Repetition Construct

When matching a repetition construct, $\{ S\}$ , we need to choose when to stop recognising occurrences of $S$ , and match the empty sequence instead.
We can predict this based on the value of the current token if:
- First( $S$ ) is disjoint from the set of tokens that immediately follow
Assuming that our grammar satisfies this restriction:
EBNF Construct S Recogniser code recog(S)
$\{S\}$ while (tokens.isIn(First(S))) {
```
  recog(S)
```
} |

EBNF Construct S	Recogniser code recog(S)
$\{S\}$	while (tokens.isIn(First(S))) {

2.6.1 - Matching a Repetition Construct - Example

$\text{Term}\rightarrow\text{Factor\{(TIMES | DIVIDE) Factor\}}$

void parseTerm(){
  parseFactor();
  while(tokens.isIn(TERMS_OPS_SET)){
		// Step 1: Match either a Tokens.TIMES or Tokens.DIVIDE or throw an error
    if(tokens.isMatch(Tokens.TIMES)){
      tokens.match(Tokens.TIMES);
    } else if (tokens.isMatch(Token.DIVIDE)) {
      tokens.match(Tokens.DIVIDE);
    } else {
      fatal("Internal Error); // Theoretically unreachable
    }
    // Step 2: Match the factor
		parseFactor();
  }
}

2.7 - Summary

We have outlined a systematic process for writing a recursive-descent parser that recognises sentences in the language of a context-free grammar (of a restricted form) written in EBNF
However, we are recognising the grammatical structure with a purpose, e.g. to
- Construct an Abstract Syntax Tree or
- Calculate the value of an expression

3.0 - Parsing and Computing Expressions

🌱 Let’s now write a recursive descent parser that not only recognises an expression, but calculates it assuming that there are no syntacticial errors.

The following is a simple grammar for calculator expressions:
$\text{Exp}\rightarrow \text{[PLUS | MINUS] Term} \{(\text{PLUS | MINUS) TERM}\}\\ \text{Term}\rightarrow\text{Factor \{(TIMES | DIVIDE) Factor\}}\\ \text{Factor}\rightarrow\text{LPAREN Exp RPAREN | NUMBER}$

Firstly, let’s look at parsing a factor. The production for a $\text{Factor}$ is:

$\text{Factor}\rightarrow\text{LPAREN Exp RPAREN | NUMBER}$

parseFactor() {
  
  if (tokens.isMatch(Token.LPAREN)) {
    tokens.match(Token.LPAREN);
             parseExp();
    tokens.match(Token.RPAREN);
  } else if (tokens.isMatch(Token.NUMBER)) {
    
    tokens.match(Token.NUMBER);
  } else {
    errors.error("Syntax error");
  }

}

Evaluate the expression and store the result in a local variable.

parseFactor() {
  int result;
  if (tokens.isMatch(Token.LPAREN)) {
    tokens.match(Token.LPAREN);
    result = parseExp();
    tokens.match(Token.RPAREN);
  } else if (tokens.isMatch(Token.NUMBER)) {
    result = tokens.getIntValue(); // Get the integer value of the current token,
                                   // assuming it's a number
    tokens.match(Token.NUMBER);    // Moves to the next token, so the line before 
                                   // MUST be before
  } else {
    errors.error("Syntax error");
    result = 0x8080808080;         // Set to a random garbage value 
  }
  return result;
}

Next, let’s look at parsing a $\text{Term}$

$\text{Term}\rightarrow\text{Factor \{(TIMES | DIVIDE) Factor\}}$

int parseTerm() {
  int result;
  result = parseFactor(); // Result returned if times or divide not in expression
  while (tokens.isIn(TERMS_OPS_SET))) { // Times or divide
    boolean times;
    if (tokens.isMatch(Token.TIMES)) {
      tokens.match(Token.TIMES);
      times = true;
    } else if (tokens.isMatch(Token.DIVIDE)) {
      tokens.match(Token.DIVIDE);
      times = false;
    } else {
      fatal("Internal Error"); // Unreachable.
    }
		int factor = parseFactor(); // The inner factor
		if (times) {
      result = result * factor;
    } else {
      result = result / factor;
    }
  }
	return result;
}

Next, let’s look at parsing a $\text{Exp}\rightarrow\text{[PLUS | MINUS] Term\{(PLUS | MINUS) Term\}}$

int parseExp() {
  boolean isPlus = false; 
  boolean isMinus = false;

  if (tokens.isMatch(Token.PLUS)) {
    tokens.match(Token.PLUS);
    isPlus = true;
  } else if (tokens.isMatch(Token.MINUS)){
    tokens.match(Token.MINUS);
    isMinus = true;
  }

  int term = parseTerm();
  parseExp();
}

4.0 - Building an Abstract Syntax Tree (AST)

As well as parsing a language, we would like to create a representation of the language in the form of an Abstract Syntax Tree (AST)
To do this, we change the parse method for a non-terminal $N$ , so that it returns the abstract syntax tree representation of $N$ , assuming that all of the parse method it calls returns the abstract syntax tree representations of the non-terminals they recognise.

4.1 - Building an Abstract Syntax Tree (AST) - Parsing Example

We’re now building an AST instead of parsing the expression
- We return to the PL0 Compiler Data Structures document and use some of the data structure defined there to create our abstract syntax tree
- Every expression node ExpNode has a location (Location loc) and a type (discussed later)
  - Location Line number and column number
- We can represent our operators as BinaryNodes as they have two statements on either side of them.
  - A BinaryNode has the following arguments: arguments
```
BinaryNode(Location loc, Type t, Operator op, ExpNode left, ExpNode right)
```

🌱 We now want our parseTerm method to return an ExpNode (Expression Node)

ExpNode parseTerm() {
  ExpNode result = parseFactor();
  while (tokens.isIn(TERM_OPS_SET)) {
    Operator op = Operator.INVALID_OP;
    Location opLoc = tokens.getLocation(); // Location of times or divide. Used in AST.
    if (tokens.isMatch(Token.TIMES)) {
      tokens.match(Token.TIMES);
      op = Operator.MUL_OP;
    } else if (tokens.isMatch(Token.DIVIDE)) {
      tokens.match(Token.DIVIDE);
      op = Operator.DIV_OP;
    } else {
      fatal("Internal error"); // Unreachable
    }
    ExpNode right = parseFactor();
    result = new ExpNode.BinaryNode(opLoc, op, result, right);
  }
  return result;
}

4.2 - Building an AST - Visual Example

🌱 Let’s build the AST for $2*3/4$

At the start, our current token points to $2$ .
- So parseFactor() will return an AST for $2$
- Note that $l_1$ denotes Location 1 in this example but should be replaced by a line number and column number.
Our current token now points to $*$
- We recognise that it is either a TIMES or DIVIDE - and so we parse the sequence $\text{\{(TIMES | DIVIDE) Factor\}}$ .
- We know that the location for the start of this sequence is Location 2
- We also know that the operation is a multiplication operation (MUL_OP)
Our current token now points to $3$
- We now parse the $\text{Factor}$ component of the sequence $\text{\{(TIMES | DIVIDE) Factor\}}$
- We now have the LHS and RHS components of our BinaryOperator, so we can add pointers to them to create the AST for the binary operator.
Our current token now points to $/$ (the divide operator)
- Our current token is a divide symbol, so we can’t terminate the loop - we have to read in another occurrence of $\text{\{(TIMES | DIVIDE) Factor\}}$
- From the information at hand, we know that the operator is located at Location 4 ( $l_4$ ) and that it’s a division operator DIV_OP
- In reading in the BinaryOperator token, we move our current token pointer to the factor $4$ .
- As before, we parse it
Our current pointer now points to nothing / null
- As before, we put together our Abstract Syntax Tree
We can terminate as the pointer is no longer pointing at a MUL_OP or DIV_OP

4.3 - Recursive Descent Parser So Far…

We have looked at:
- A systematic process for writing a recursive-descent parser that recognises sentences in the language of a context-free grammar (of a restricted form) written in EBNF
- How to build an abstract syntax tree for the sentence that we are recognising
But programmers make errors