Recursive Descent Parser

CourseCompilers & Interpreters
SemesterS1 2022

Recursive Descent Parser

These notes are based on the Recursive Descent Parser notes by Ian Hayes.

  1. Extended Backus Naur Form
  2. Recursive Descent Parsing
  3. Syntax Error Recovery
  4. LL(1) Grammars

Extended Backus Naur Form

  • Backus-Naur Form allows a context-free grammar to be described by a set of productions, each of the form NαN\rightarrow\alpha, where the:
    • left side consists of a single non-terminal symbol NN, and
    • right side consists of a possibly empty sequence of terminal and non-terminal symbols, α\alpha.
  • For example, a production in the syntax for an expression follows:

ExpExp +" Term Exp \rightarrow Exp\ ``+"\ Term

  • To allow the grammar to be represented more succinctly, BNF allows the right side of a production to consist of a set of alternatives, separated by ‘|’, as follows:

Nα1  α2    αm \def\or{\ |\ } N\rightarrow\alpha_1\or\alpha_2\or\cdots\or\alpha_m

  • Note that this doesn’t add any expressive power, because it could have been written as a set of productions for N, one for each right side alternative.

  • Extended Backus-Naur Form (EBNF) is a notation invented to allow grammars to be expressed even more succinctly.

  • It extends BNF to allow three additional constructs:

    1. optional syntactic construct, contained within square brackets,
    2. repetition of constructs, contained within curly braces, and
    3. grouping of constructs, within parentheses
  • It is important to note that a grammar written in EBNF can be converted to BNF

    • Therefore, EBNF doesn’t add any expressive power, as an EBNF grammar can be converted to an equialent BNF grammar.

Converting EBNF Constructs to BNF

Converting the Optional Construct to BNF

An optional construct [ S ][\ S\ ] occurring in the right side of a grammar rule (production) is replaced by a new non-terminal, OptSOptS, which is defined as follows:

OptSS  ϵ(1)\tag{1}OptS\rightarrow S\ |\ \epsilon


Where ϵ\epsilon is the empty string. For example, consider the rule RelConditionRelCondition:

RelConditionExp[ RelOp Exp ]RelCondition\rightarrow Exp [\ RelOp\ Exp\ ]

We can re-write this as:

RelConditionExp OptRelExpOptRelExpRelOpExp  ϵ \def\or{\ |\ } \begin{aligned} RelCondition &\rightarrow Exp\ OptRelExp\\ OptRelExp &\rightarrow RelOp Exp \or \epsilon \end{aligned}

Note that in this specific case, we could shorten this to:

RelConditionExp RelOp ExpExpRelCondition\rightarrow Exp\ RelOp\ Exp | Exp

Converting the Repetition Construct to BNF

A repetition construct { S }\{ \ S\ \} occurring in the ride side of a grammar rule is replaced by a new non-terminal symbol RepSRepS, which is defined as follows.

RepS(S)RepSϵ(2)\tag{2}RepS \rightarrow (S) RepS | \epsilon

The grouping parentheses around the SS are needed int he case that SS consists of a set of alternatives, in which case, leaving out the parentheses gives:

RepSA  BRepS  ϵ(3)\tag{3} \def\or{\ |\ } RepS \rightarrow A \or B RepS \or \epsilon

This is incorrect, because the concatenation operation has higher precedence than alternation. The correct version is as follows:

RepS(A  B)RepS  ϵ(4)\tag{4} \def\or{\ |\ } RepS \rightarrow (A \or B) RepS \or \epsilon


For example, the rule for TermTerm is given as:

TermFactor{(TIMES|DIVIDE)Factor}Term\rightarrow Factor \{(\text{TIMES|DIVIDE}) Factor\}

We could re-write this as:

TermFactor RepFRepF((TIMES|DIVIDE)Factor)RepFϵ(5) Term\rightarrow Factor\ RepF\\ \tag{5} RepF \rightarrow ((\text{TIMES|DIVIDE}) Factor) RepF | \epsilon

In this case, the outer parentheses introduced by the rule are redundant, and can be omitted.

Converting the Grouping Construct to BNF A grouping construct ( S )(\ S\ ) occurring in the right side of a grammar rule is replaced by a new non-terminal GrpSGrpS, defined as follows:

GrpSS(6)\tag{6}GrpS\rightarrow S


For example, the rule for RepFRepF is defined above as:

RepF(TIMES|DIVIDE)Factor)RepFϵRepF \rightarrow (\text{TIMES|DIVIDE}) Factor) RepF | \epsilon

We can re-write this as:

RepFTermOp Factor RepF  ϵTermOpTIMES|DIVIDE \def\or{\ |\ } RepF\rightarrow TermOp\ Factor\ RepF \or \epsilon\\ TermOp\rightarrow \text{TIMES|DIVIDE}

Recursive Descent Parsing

Provided that the EBNF grammar for a language is in a restricted form, it can be used to derive a recursive-descent parser. (If a grammar is an LL(1) grammar, it can be parsed using recursive-descent parsing).

A recursive descent parser is a recursive program to recognise sentences in a language. It consists of a set of methods - one for each non-terminal symbol. The input to the recursive descent parser is assumed to be a stream of lexical (terminal) symbols, and the parse begins with the current token beign the first token in the input stream. The token stream in the parser is represented by the variable tokens, of type TokenStream.

For each non-terminal symbol, NN, there is an associated method parseN that recognises the longest stream of lexical tokens in the input stream, starting from the current token that can be derived from the non-terminal NN. As it parses the input, it moves the current location forward, and when it has finished parsing NN, the current token is the token immediately following the last token matched as part of NN.

Example Recursive Descent Parser

Consider the following grammar:

RelConditionExp [ RelOp Exp ]RelOpEQUALS | NEQUALS | LEQUALS | LESS | GREATER | GEQUALSExp[PLUS|MINUS] Term {(PLUS|MINUS) Term}TermFactor {TIMES | DIVIDE}Factor}FactorLPAREN RelCondition RPAREN  NUMBER  LValueLValueIDENTIFIER\def\rq{\text{\textquoteright\ }} \def\minus{\ \lq - \rq} \def\plus{\ \lq + \rq} \def\or{\ |\ } \begin{aligned} RelCondition &\rightarrow Exp\ [\ RelOp\ Exp\ ]\\ RelOp &\rightarrow \text{EQUALS | NEQUALS | LEQUALS | LESS | GREATER | GEQUALS}\\ Exp &\rightarrow [\text{PLUS|MINUS}]\ Term\ \{(\text{PLUS|MINUS})\ Term\}\\ Term &\rightarrow Factor\ \{\text{TIMES | DIVIDE}\} Factor\}\\ Factor &\rightarrow LPAREN\ RelCondition\ RPAREN \or NUMBER \or LValue\\ LValue &\rightarrow IDENTIFIER \end{aligned}

We create the following methods for its recursive-descent parser.

Concepts

Matching a Terminal Symbol

To recognise a terminal symbol TT, the parser calls the method tokens.match on the input token stream, with the terminal symbol Token.T as the parameter:

tokens.match(Token.T);

If the current input token is TT, the match is successful, and the location in the input stream is moved to the next token. If the current token is not TT, a syntax error message is generated.

Recognising as Sequence

A sequence, S1S2SnS_1 S_2 \cdots S_n, of EBNF factors is recognised by code that recognises each in sequence.

recog(S_1);
recog(S_2);
recog(S_3);
...
recog(S_n);

If the sequence of EBNF factors to be matched is empty, then no code is required to match it. For example, one alternative for a FactorFactor in the expression grammar defined above, is the sequence LPAREN RelCondition RPAREN which can be recognised by the following code sequence

tokens.match(Token.LPAREN);
parseRelCondition();
token.match(Token.RPAREN);

Recognising Alternatives

To recognise a set of EBNF alternatives, S1S2SnS_1|S_2|\cdots|S_n, we need to first determine which alternative to use. Our (predictive) recursive-descent parser predicts which alternative to recognised based on just the current token in the input stream. For a particlar alternative, SiS_i to be chosen, either:

  1. The current token is in the set of terminal symbols, First(Si)First(S_i) that can start SiS_i or,
  2. SiS_i is nullable, and the current token can validly follow in the context of the complete set of alternatives [1]

For this approach to work, we must restrict the language grammar, such that:

  1. Any given token is in the first set of at most one laternative, i.e. the sets of first symbols of each pair of alternatives are disjoint
  2. At most one alternative is nullable
  3. If there is a nullable alternative, the set of symbols that may immediately follow in the contexdt of the complete set of alternatives is disjoint from the first sets of all alternatives.

For more information about what restrictions we must place on a grammar for us to be able to use a recursive descent parser on it, look at the section on LL(1) Grammars [1] One of the requirements for an LL(1) grammar is that only a single non-terminal symbol may be nullable.

If none of the alternatives are nullable, the code to parse the set of alternatives is as follows, in which the isIn(S) method checks if the current token is in the set of tokens S.

  • If one of the alternatives are nullable, the code for the recogniser is the same as the above, except we leave of the last “else” clause, which gives an error message.
if (tokens.isIn(First(S_1))) {
  recog(S_1);
} else if (tokens.isIn(First(S_2))) {
  recog(S_2);
} else ...
  ...
} else if (tokens.isIn(First(S_n))) {
  recog(S_n);
} else {
  errors.error("Syntax error");
}
  • Note that if a first set First(S_i) only contains a single token TiT_i, we can replace the call tokens.isIn(First(S_i)) with tokens.isMatch(T_i).

Recognising an Optional

An EBNF optional construct is of the form [ S ][\ S\ ], where SS is an EBNF expression.It can either match SS or the empty sequence, and hence it is similar to matching alternatives. The code to match [ S ][\ S\ ] is as follows:

if (tokens.isIn(First(S))) {
  recog(S);
}

To be unambiguous, First(S) should be disjoint from the set of tokens that can immediately follow the optional construct. For example, the production for RelConditionRelCondition conatins an optiona. The code for the parser for a RelConditionRelCondition is given as:

void parseRelCondition() {
  parseExp();
  if (tokens.isIn(REL_OPS_SET)) {
      parseRelOp(); // This cannot fail, as we know that the current token can start a RelOp
      parseExp();
  }
}

Recognising a Repetition

An EBNF repetition construct is of the form { S }\{\ S\ \}, where SS is an EBNF expression. It can recognise a sequence of zero or more occurrences of SS. The code to recognise { S }\{\ S\ \} consists of a loop.

while(tokens.isIn(First(S))) {
  recog(S);
}

To be unambiguous, the first set of S should be disjoint from the set of tokens that can follow the repetition construct.

Recognising Grouping

In EBNF, parentheses are used to represent grouping, and thus override the default precedence of EBNF, such as S1(S2S3)S_1(S_2|S_3). In generating the recognition code for such EBNF expressions, we must respect the grouping. Normally, we do not have to do anything extra in the code, because the generated code already has Java grouping curly braces. The parsing method in Term contains an example of recognising a group.

Removing Redundant Code

Using the rules defined above, the code for recognising [PLUS | MINUS][\text{PLUS | MINUS}] at the beginning of parseExpparseExp should be:

if (tokens.isIn(EXP_OPS_SET)) {
  if (tokens.isMatch(Token.PLUS))) {
    tokens.match(Token.PLUS);
  } else if (tokens.isMatch(Token.MINUS)) {
    tokens.match(Token.MINUS);
  } else {
    // Unreachable due to condition in outer if.
    errors.error("Syntax error");
  }
}

However, we can optimise this by removing the outer “if” test, and the final “else” branch to give the following code:

if (tokens.isMatch(Token.PLUS))) {
  tokens.match(Token.PLUS);
} else if (tokens.isMatch(Token.MINUS)) {
  tokens.match(Token.MINUS);
}

Syntax Error Recovery

Focus of this type of error recovery is to handle simple, common errors. Syntax error recovery during recursive descent parsing can be accomplished by

  1. Local error recovery on matching a single token; and
  2. Synchronising the input stream at the start of each parse method.

Local Error Recovery

On matching a single token, we can handle:

  • A (single) token missing from the input stream
  • An additiona token erroneously inserted into the input stream, or
  • A single erroneous token replacing the expected token in the input stream.

To handle local error recovery while matching a token, TT, the match method requires a second parameter, followSet that is the set of tokens that can immediately follow TT in the context in which TT is being matched. The followSet is only used iwhen the current token doesn’t match T, in which case an error message is generated and followSet is used for syntax error recovery as follows:

  • If the current token is in followSet, the syntax error recovery is to assume that TT was missing from the input, and hence no further action is taken.
  • If the current token is not in followSet, the current input token is skipped (and thus becoming the previous token), and then:
    • If the current token is now TT the recovery is to assume that the previous token was erroneously inserted into the input stream, and the recovery action is to match the (new) current token.
    • If the current token is not TT, the recovery is to assume that the previous token in the input stream was supposed to be TT but was replaced by an erroneous token; no further error recovery action is required.

Local Error Recovery for Recogniser - While Statement

The following is an example of parsing a while statement with local error recovery (this doesn’t include input stream synchronisation!)

void parseWhileStatement(TokenSet recoverSet) {
  tokens.match(Token.KW_WHILE, CONDITION_START_SET);
  parseCondition(recoverSet.union(Token.KW_DO));
  tokens.match(Token.KW_DO, STATEMENT_START_SET);
  parseStatement(recoverSet);
}

Synchronising the Input Stream

At the start of a parsing method for a non-terminal symbol NN, the current token is valid if:

  1. It can start NN, i.e. it is in First(N)First(N), or
  2. N is nullable, and the current token may follow N in the context in which N is being recognised.

Before trying to match a non-terminal symbol NN, we synchronise the input to ensure that the current token is valid using the method parse whose header is first. The possible headers for the parse method are as follows (here, we’re specifically referring to the first header).

// Normal case, as described in the section below.
void parse(String rule, TokenSet expected, TokenSet recoverSet, ParseVoid parser);
// Used when `expected` is a single token.
void parse(String rule, Token expected, TokenSet recoverSet, ParseVoid parser);
// Used when the parser is required to return a value of type Node
Node parse(String rule, TokenSet expected, TokenSet recoverSet, ParseVoid parser);
// Used when the parser is required to return a value of type Node, and `expected` is a single token.
Node parse(String rule, Token expected, TokenSet recoverSet, ParseVoid parser);

The parameters to the parse method is:

  1. String rule String containing the name of the parsing rule, used in error (and debugging) messages.
  2. Token(Set) expected Set of tokens (or singular token) that are valid at the start of parsing N
    • if NN is not nullable, expected=First(N)\text{expected}=\text{First}(N)
    • if NN is nullable, expected=First(N)Follow(N)\text{expected}=\text{First}(N)\cup\text{Follow}(N) - this can be approximated by the recoverSet in most cases.
  3. TokenSet recoverSet the set of tokens that are expected to follow N, in the context in which N is being parsed.
    • Since recoverSet depends on the context in which NN is to be recognised, it is passed as a parameter to the parse method for each non-terminal
    • Note that recoversetFollow(N)\text{recoverset}\ne\text{Follow}(N)
  4. ParseVoid parser A lambda / anonymous function with no parameter body, and a (code) body consisting of the code to recognise the non-terminal symbol.

We modify the existing parse methods (without synchronisation error recovery) to the following form:

/**
 * Generic form of a parse method with synchronisation error recovery.
 */
void parseN(TokenSet recoverSet) {
  parse("N", N_START_SET, recoverSet,
    () -> { // Guaranteed that the current token is in N_START_SET
      recog(N);
  });
}

If on a call to parse, the current token is in the set expected, there is no syntax error and the synchronisation method parse just calls the parser passed as its fourth parameter parser and attempts to recognise NN.

If on a call to parse, the curren token is not in the set expected, a syntax error message is generated, and the recovery action is to skip tokens until either:

  1. A token is found that is valid at the start of NN, (i.e. it is in expected), in which case, parse will proceed to recognise NN from that point by calling the lambda function passed in as the fourth parameter.
  2. A token that is not in expected but is in recoverSet is found, in which case parse runs thorugh as if it had recognised NN (without really doing so), and for the case where it needs to return a result, it returns an error node / value.

The Parse Method

The method parse implements this skipping / synchronisation strategy by skipping tokens until a token is found that is either in expected or recoverSet.

  • If the token is in expected, (i.e. the first case above, when synchronisation succeeds), the parse method calls the lambda function for NN and proceeds to (try to) parse N.
  • If the token is not in expected, (and hence is in the recoverSet), parse exits as though it has already parsed NN, but returning an error message instead.

Synchronisation at the end of Parse Methods

At the end of the parse method for each non-terminal, NN< afterNN has been recognised by the call on the lambda function parser, the parse method syncvhronises the input stream so that the current token is one that is expected to follow NN (i.e. it is in the recoverSet). If at the end the current token is in the recoverSet, the parse method does nothing. Otherwise, an error message is generated, and the syntax error recovery strategy is to skip until one is foudn that is in the recoverSet, at which point, the parse method returns.

Accumulation of Tokens in Recovery Sets.

LL(1) Grammars

… … … …

… … … …

… … … …

… … … …

… … … …

… … … …

… … … …

… … … …