Recursive Descent Parser
Recursive Descent Parser
These notes are based on the Recursive Descent Parser notes by Ian Hayes.
Extended Backus Naur Form
- Backus-Naur Form allows a context-free grammar to be described by a set of productions, each of the form , where the:
- left side consists of a single non-terminal symbol , and
- right side consists of a possibly empty sequence of terminal and non-terminal symbols, .
- For example, a production in the syntax for an expression follows:
- To allow the grammar to be represented more succinctly, BNF allows the right side of a production to consist of a set of alternatives, separated by ‘|’, as follows:
Note that this doesn’t add any expressive power, because it could have been written as a set of productions for N, one for each right side alternative.
Extended Backus-Naur Form (EBNF) is a notation invented to allow grammars to be expressed even more succinctly.
It extends BNF to allow three additional constructs:
- optional syntactic construct, contained within square brackets,
- repetition of constructs, contained within curly braces, and
- grouping of constructs, within parentheses
It is important to note that a grammar written in EBNF can be converted to BNF
- Therefore, EBNF doesn’t add any expressive power, as an EBNF grammar can be converted to an equialent BNF grammar.
Converting EBNF Constructs to BNF
Converting the Optional Construct to BNF
An optional construct occurring in the right side of a grammar rule (production) is replaced by a new non-terminal, , which is defined as follows:
Where is the empty string. For example, consider the rule :
We can re-write this as:
Note that in this specific case, we could shorten this to:
Converting the Repetition Construct to BNF
A repetition construct occurring in the ride side of a grammar rule is replaced by a new non-terminal symbol , which is defined as follows.
The grouping parentheses around the are needed int he case that consists of a set of alternatives, in which case, leaving out the parentheses gives:
This is incorrect, because the concatenation operation has higher precedence than alternation. The correct version is as follows:
For example, the rule for is given as:
We could re-write this as:
In this case, the outer parentheses introduced by the rule are redundant, and can be omitted.
Converting the Grouping Construct to BNF A grouping construct occurring in the right side of a grammar rule is replaced by a new non-terminal , defined as follows:
For example, the rule for is defined above as:
We can re-write this as:
Recursive Descent Parsing
Provided that the EBNF grammar for a language is in a restricted form, it can be used to derive a recursive-descent parser. (If a grammar is an LL(1) grammar, it can be parsed using recursive-descent parsing).
A recursive descent parser is a recursive program to recognise sentences in a language. It consists of a set of methods - one for each non-terminal symbol. The input to the recursive descent parser is assumed to be a stream of lexical (terminal) symbols, and the parse begins with the current token beign the first token in the input stream. The token stream in the parser is represented by the variable tokens, of type TokenStream.
For each non-terminal symbol, , there is an associated method parseN that recognises the longest stream of lexical tokens in the input stream, starting from the current token that can be derived from the non-terminal . As it parses the input, it moves the current location forward, and when it has finished parsing , the current token is the token immediately following the last token matched as part of .
Example Recursive Descent Parser
Consider the following grammar:
We create the following methods for its recursive-descent parser.
Concepts
Matching a Terminal Symbol
To recognise a terminal symbol , the parser calls the method tokens.match on the input token stream, with the terminal symbol Token.T as the parameter:
tokens.match(Token.T);
If the current input token is , the match is successful, and the location in the input stream is moved to the next token. If the current token is not , a syntax error message is generated.
Recognising as Sequence
A sequence, , of EBNF factors is recognised by code that recognises each in sequence.
recog(S_1);
recog(S_2);
recog(S_3);
...
recog(S_n);
If the sequence of EBNF factors to be matched is empty, then no code is required to match it. For example, one alternative for a in the expression grammar defined above, is the sequence LPAREN RelCondition RPAREN which can be recognised by the following code sequence
tokens.match(Token.LPAREN);
parseRelCondition();
token.match(Token.RPAREN);
Recognising Alternatives
To recognise a set of EBNF alternatives, , we need to first determine which alternative to use. Our (predictive) recursive-descent parser predicts which alternative to recognised based on just the current token in the input stream. For a particlar alternative, to be chosen, either:
- The current token is in the set of terminal symbols, that can start or,
- is nullable, and the current token can validly follow in the context of the complete set of alternatives [1]
For this approach to work, we must restrict the language grammar, such that:
- Any given token is in the first set of at most one laternative, i.e. the sets of first symbols of each pair of alternatives are disjoint
- At most one alternative is nullable
- If there is a nullable alternative, the set of symbols that may immediately follow in the contexdt of the complete set of alternatives is disjoint from the first sets of all alternatives.
For more information about what restrictions we must place on a grammar for us to be able to use a recursive descent parser on it, look at the section on LL(1) Grammars [1] One of the requirements for an LL(1) grammar is that only a single non-terminal symbol may be nullable.
If none of the alternatives are nullable, the code to parse the set of alternatives is as follows, in which the isIn(S) method checks if the current token is in the set of tokens S.
- If one of the alternatives are nullable, the code for the recogniser is the same as the above, except we leave of the last “else” clause, which gives an error message.
if (tokens.isIn(First(S_1))) {
recog(S_1);
} else if (tokens.isIn(First(S_2))) {
recog(S_2);
} else ...
...
} else if (tokens.isIn(First(S_n))) {
recog(S_n);
} else {
errors.error("Syntax error");
}
- Note that if a first set
First(S_i)only contains a single token , we can replace the calltokens.isIn(First(S_i))withtokens.isMatch(T_i).
Recognising an Optional
An EBNF optional construct is of the form , where is an EBNF expression.It can either match or the empty sequence, and hence it is similar to matching alternatives. The code to match is as follows:
if (tokens.isIn(First(S))) {
recog(S);
}
To be unambiguous, First(S) should be disjoint from the set of tokens that can immediately follow the optional construct. For example, the production for conatins an optiona. The code for the parser for a is given as:
void parseRelCondition() {
parseExp();
if (tokens.isIn(REL_OPS_SET)) {
parseRelOp(); // This cannot fail, as we know that the current token can start a RelOp
parseExp();
}
}
Recognising a Repetition
An EBNF repetition construct is of the form , where is an EBNF expression. It can recognise a sequence of zero or more occurrences of . The code to recognise consists of a loop.
while(tokens.isIn(First(S))) {
recog(S);
}
To be unambiguous, the first set of S should be disjoint from the set of tokens that can follow the repetition construct.
Recognising Grouping
In EBNF, parentheses are used to represent grouping, and thus override the default precedence of EBNF, such as . In generating the recognition code for such EBNF expressions, we must respect the grouping. Normally, we do not have to do anything extra in the code, because the generated code already has Java grouping curly braces. The parsing method in Term contains an example of recognising a group.
Removing Redundant Code
Using the rules defined above, the code for recognising at the beginning of should be:
if (tokens.isIn(EXP_OPS_SET)) {
if (tokens.isMatch(Token.PLUS))) {
tokens.match(Token.PLUS);
} else if (tokens.isMatch(Token.MINUS)) {
tokens.match(Token.MINUS);
} else {
// Unreachable due to condition in outer if.
errors.error("Syntax error");
}
}
However, we can optimise this by removing the outer “if” test, and the final “else” branch to give the following code:
if (tokens.isMatch(Token.PLUS))) {
tokens.match(Token.PLUS);
} else if (tokens.isMatch(Token.MINUS)) {
tokens.match(Token.MINUS);
}
Syntax Error Recovery
Focus of this type of error recovery is to handle simple, common errors. Syntax error recovery during recursive descent parsing can be accomplished by
- Local error recovery on matching a single token; and
- Synchronising the input stream at the start of each parse method.
Local Error Recovery
On matching a single token, we can handle:
- A (single) token missing from the input stream
- An additiona token erroneously inserted into the input stream, or
- A single erroneous token replacing the expected token in the input stream.
To handle local error recovery while matching a token, , the match method requires a second parameter, followSet that is the set of tokens that can immediately follow in the context in which is being matched. The followSet is only used iwhen the current token doesn’t match T, in which case an error message is generated and followSet is used for syntax error recovery as follows:
- If the current token is in
followSet, the syntax error recovery is to assume that was missing from the input, and hence no further action is taken. - If the current token is not in
followSet, the current input token is skipped (and thus becoming the previous token), and then:- If the current token is now the recovery is to assume that the previous token was erroneously inserted into the input stream, and the recovery action is to match the (new) current token.
- If the current token is not , the recovery is to assume that the previous token in the input stream was supposed to be but was replaced by an erroneous token; no further error recovery action is required.
Local Error Recovery for Recogniser - While Statement
The following is an example of parsing a while statement with local error recovery (this doesn’t include input stream synchronisation!)
void parseWhileStatement(TokenSet recoverSet) {
tokens.match(Token.KW_WHILE, CONDITION_START_SET);
parseCondition(recoverSet.union(Token.KW_DO));
tokens.match(Token.KW_DO, STATEMENT_START_SET);
parseStatement(recoverSet);
}
Synchronising the Input Stream
At the start of a parsing method for a non-terminal symbol , the current token is valid if:
- It can start , i.e. it is in , or
- N is nullable, and the current token may follow N in the context in which N is being recognised.
Before trying to match a non-terminal symbol , we synchronise the input to ensure that the current token is valid using the method parse whose header is first. The possible headers for the parse method are as follows (here, we’re specifically referring to the first header).
// Normal case, as described in the section below.
void parse(String rule, TokenSet expected, TokenSet recoverSet, ParseVoid parser);
// Used when `expected` is a single token.
void parse(String rule, Token expected, TokenSet recoverSet, ParseVoid parser);
// Used when the parser is required to return a value of type Node
Node parse(String rule, TokenSet expected, TokenSet recoverSet, ParseVoid parser);
// Used when the parser is required to return a value of type Node, and `expected` is a single token.
Node parse(String rule, Token expected, TokenSet recoverSet, ParseVoid parser);
The parameters to the parse method is:
String ruleString containing the name of the parsing rule, used in error (and debugging) messages.Token(Set) expectedSet of tokens (or singular token) that are valid at the start of parsing N- if is not nullable,
- if is nullable, - this can be approximated by the
recoverSetin most cases.
TokenSet recoverSetthe set of tokens that are expected to follow N, in the context in which N is being parsed.- Since
recoverSetdepends on the context in which is to be recognised, it is passed as a parameter to the parse method for each non-terminal - Note that
- Since
ParseVoid parserA lambda / anonymous function with no parameter body, and a (code) body consisting of the code to recognise the non-terminal symbol.
We modify the existing parse methods (without synchronisation error recovery) to the following form:
/**
* Generic form of a parse method with synchronisation error recovery.
*/
void parseN(TokenSet recoverSet) {
parse("N", N_START_SET, recoverSet,
() -> { // Guaranteed that the current token is in N_START_SET
recog(N);
});
}
If on a call to parse, the current token is in the set expected, there is no syntax error and the synchronisation method parse just calls the parser passed as its fourth parameter parser and attempts to recognise .
If on a call to parse, the curren token is not in the set expected, a syntax error message is generated, and the recovery action is to skip tokens until either:
- A token is found that is valid at the start of , (i.e. it is in
expected), in which case,parsewill proceed to recognise from that point by calling the lambda function passed in as the fourth parameter. - A token that is not in
expectedbut is inrecoverSetis found, in which caseparseruns thorugh as if it had recognised (without really doing so), and for the case where it needs to return a result, it returns an error node / value.
The Parse Method
The method parse implements this skipping / synchronisation strategy by skipping tokens until a token is found that is either in expected or recoverSet.
- If the token is in
expected, (i.e. the first case above, when synchronisation succeeds), the parse method calls the lambda function for and proceeds to (try to) parse N. - If the token is not in
expected, (and hence is in therecoverSet), parse exits as though it has already parsed , but returning an error message instead.
Synchronisation at the end of Parse Methods
At the end of the parse method for each non-terminal, < after has been recognised by the call on the lambda function parser, the parse method syncvhronises the input stream so that the current token is one that is expected to follow (i.e. it is in the recoverSet). If at the end the current token is in the recoverSet, the parse method does nothing. Otherwise, an error message is generated, and the syntax error recovery strategy is to skip until one is foudn that is in the recoverSet, at which point, the parse method returns.
Accumulation of Tokens in Recovery Sets.
LL(1) Grammars
… … … …
… … … …
… … … …
… … … …
… … … …
… … … …
… … … …
… … … …