Week 3.1

CourseCompilers & Interpreters
SemesterS1 2022

Week 3.1

🏷️ Replace this section with some keywords that can be used for searching, tagging and the indexing of this page.

Sections

🌱 Links to all Heading1 Blocks in this page.

  1. Section Number One

1.0 - Syntax Error Recovery

  • Programmers make errors - how do we recover from these errors?
  • There is no such thing as perfect error recovery
  • We focus on error recovery that handles simple common errors.

1.1 - Recovery Strategy for Recursive-Descent Parsing

  • Perform local error recovery on matching a single token
  • Synchronise the input stream at the start and end of each parse method (parse method → trying to parse a non-terminal symbol)

1.1.1 - Recovery Strategy for Matching a Single Token (Terminal Symbol)

  • What if the current token isn’t KW_DO?

    void parseWhileStatement() {
      tokens.match(Token.KW_WHILE);
      parseCondition();
      tokens.match(Token.KW_DO); // An error is encountered at this step.
      parseStatement();
    }
    
  • It might be missing:

    while i < 5               begin x := x + i * i; i := i + 1 end\text{while i < 5\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ begin x := x + i * i; i := i + 1 end}

    • In this case, we notice that the next token is begin\text{begin} is something that starts a statement (and a statement is the next thing to be parsed)
    • We can assume that the keyword do is missing, and continue parsing as normal.
  • A token may be erroneously inserted.

    while i < 5 end do begin x := x * i; i := i + 1 end\text{while i < 5 {\color{ff7f7f}end} do begin x := x * i; i := i + 1 end}

  • A single erroneous token replacing the expected token

    while i < 5 then begin x := x * i; i := i + 1 end\text{while i < 5 {\color{ff7f7f}then} begin x := x * i; i := i + 1 end}

  • When the next token is not what we expected it to be, we look at the next token along:

    • If after skipping a token, the next token is what we expected it to be, we know that a token was erroneously inserted
    • If after skipping a token, the next token isn’t what we expected it to be, we assume that it was a replacement for the token that we were looking for and try to parse the following tokens.
  • We can replace the original tokens.match method with one that has two arguments:

    • tokens.match(token_to_match, proceeding_set)
    • token_to_match the token that we’re trying to match to
    • proceeding set a set of all of the tokens that could start the following section / statement
    void parseWhileStatement() {
      tokens.match(Token.KW_WHILE);
      parseCondition();
      tokens.match(Token.KW_DO, STATEMENT_START_SET); 
      parseStatement();
    }
    
    • If the current token is KW_DO, match it;
    • Else if the current token is in STATEMENT_START_SET, assume KW_DO was missing and take no further action
    • Else skip the current token and:
      • if the current token is KW_DO, assume that the previous token was erroneously inserted and match the new token;
      • Else assume that the expected token was replaced by the previous token; take no further action

1.1.2 - Recovery Strategy for Parse Method (Non-Terminal Symbol)

  • We perform synchronisation at the start and end of the parse methods

    void parseN(TokenSet recoverSet) {
      // Skip tokens until the current token is in the expected set or recoverSet
      // if the new current token is not in expected
    	  // Return as though N had been recognised
    	  // Returning an error node / value if required
      // else, recognise the token N
    	  recog(N)
      // Skip tokens until the current token is in recoverSet
    }
    
    • expected is in the set of terminal symbols that are valid at the start of parsing N
    • recoverSet contains the terminal symbols that can follow N in the context in which N is being parsed
  • Essentially removing the garbage at the start and end.

1.1.3 - Implementation using Method Parse

🌱 We don’t want to hard-code the behaviour that we proposed before for every non-terminal symbol. Let’s create a pattern that does this more succinctly.

void parseN(TokenSet recoverSet) {
  parse("N", N_START_SET, recoverSet, () -> {
    recog(N)
  });
}
  • The parameters of the method parse are:

    • The name of the rule for use in error messages
    • The set of tokens expected at start of rule
    • Set of tokens to recover at on a syntax error
    • A function that parses the non-terminal.
  • The parse method performs the synchronisation at the start and end of the parseN method

  • So from this, we can modify our parseWhileStatement method from:

    void parseWhileStatement() {
      tokens.match(Token.KW_WHILE);
      parseCondition();
      tokens.match(Token.KW_DO, STATEMENT_START_SET); 
      parseStatement();
    }
    
  • To the following:

    void parseWhileStatement(TokenSet recoverSet) {
    // Name for debugging, rule start set, recover set, non-terminal parsing function
    parse("While Statement", Token.KW_WHILE, recoverSet, 
      () -> {
        // Assume that synchronisation has occurred, and 
        // The current token is KW_WHILE
    
        // Since synchronisation has occurred, the current token is KW_WHILE.
        // Cannot fail as a result of this, so don't need to use the recoverable 
        // method version of tokens.match.
        tokens.match(Token.KW_WHILE); 
        parseCondition(recoverSet.union(Token.KW_DO)); // [1]
        tokens.match(Token.KW_DO, STATEMENT_START_SET); // [2]
        parseStatement(recoverSet); // [3]
      });  
    }
    
    • [1] Note that our recoverSet is the union of the existing recoverSet (that is, the set of tokens on which the WhileStatement can recover on and KW_DO (as we are up to parsing the condition attached to the while).
      • Also note that recoverSet.union(...) doesn’t modify the original recoverSet variable, it returns a new instance
    • [2] The arguments to token.match(...) with recovery are:
      1. The token to try to match
      2. The set of tokens that can follow
    • [3] Since no token can follow (for the purpose of syntax recovery), we just use the non-recover version of token.match(...)

    1.1.4 - Recovery Strategy Example for RelCondition

    🌱 Suppose we wanted to parse a RelCondition, which is defined by the following production

    RelConditionExp[RelOp Exp]\text{RelCondition}\rightarrow\text{Exp[RelOp Exp]}

    void parseRelStatement(TokenSet recoverSet) {
      parse("RelCondition", REL_COND_START_SET, recoverSet, 
        // Anonymous parse method for parsing a RelCondition
        () -> {
            // The current token is in REL_COND_START_SET
            parseExp(recoverSet.union(REL_OPS_SET));  // [1]
            // Parse the optional [RelOp Exp] portion of the production [2]
            if (tokens.isIn(REL_OPS_SET)) {
                parseRelOp(recoverSet.union(EXP_START_SET); 
                parseExp(recoverSet);
            }
        });
    }
    
    • [1] The recoverSet for parsing the expression is the union of:
      • Anything that RelCondition could recover on (this is in recoverSet already)
      • Anything that an Expression could recover on
    • [2] When parsing the RelOp set, the recoverSet would be comprised of
      • The recoverSet for the RelCondition
      • Anything that could start an expression

1.2 - Parsing: The Big Picture

  • During the syntax analysis phase, the compiler needs to
    • Recognise the grammatical structure of a sequence of lexical tokens;
    • While performing error recovery where possible, and; recover as best as we can so we can report as much of the error to the user as possible
    • Build an Abstract Syntax Tree (AST) and symbol table

1.2.1 - Parsing and Building an AST

  • Consider parsing the following production with error recovery.

    FactorLPAREN Condition RPAREN | NUMBER | LValue\text{Factor}\rightarrow\text{LPAREN Condition RPAREN | NUMBER | LValue}

    • Key - Building AST | Error Recovery | Parsing Code
    ExpNode parseFactor(TokenSet recoverSet) {
        // Use exp.parse at it returns an expression node
        // Identical function to 
        return exp.parse("Factor", FACTOR_START_SET, recoverSet, 
            () -> {
                // Assume synchronisation has been performed, 
                // the current token is in FACTOR_START_SET
                ExpNode result = null;
    
    						// There are three symbols that could start a Factor production:
                // LPAREN, NUMBER or LVALUE. Parse each of them.
                if (token.isMatch(Token.LPAREN)) {
                    // Cannot fail so don't use recovery version
    								tokens.match(Token.LPAREN); 
                    result = parseCondition(recoverSet.union(Token.RPAREN));
                    // Note that we use the recoverset as the set of tokens that can
                    // follow the Token.RPAREN - we don't actually know what comes after
                    // but we can give it ALL of the possible tokens that could come
                    // after + some more (could reduce this set based on context).
                    tokens.match(Token.RPAREN, recoverSet); // Can fail
                } else if (tokens.isMatch(Token.NUMBER)) {
                    result = new ExpNode.ConstNode(tokens.getLocation(),
                        Predefined.INTEGER_TYPE, tokens.getIntValue());
                    tokens.match(Token.NUMBER); // Cannot fail, move pointer past token
                } else if (tokens.isMatch(Token.IDENTIFIER)) {
                    result = parseLValue(recoverSet);
                } else {
                    // Defensive programming in the case FACTOR_START_SET 
                    // has an extra unintended token.
                    fatal("parseFactor");
                }
            return result;
        });
    }
    
  • This code here integrates three key things:

    1. Recognising the tokens
    2. Building the abstract syntax tree
    3. Performing (simple) error recovery.

1.2.2 - Construction of Expression Parse Method

🌱 Notice that in the code above, we have used exp.parse. How do we construct this?

  • The code to construct the expression parser exp

    ParseMethod exp = 
        new ParseMethod<>(
            ...
            (Location loc) -> new ExpNode.ErrorNode(loc));
    
    • Notice here that the ParseMethod is configured to return an ExpNode.
    • The fourth parameter of this function (the anonymous function) must also return an ExpNode.