Aho et al. Ch2.1 - 2.2 (Grammars)

CourseCompilers & Interpreters
SemesterS1 2022

Aho et al. Ch2.1 - 2.2 (Grammars)

🌱 This document summarises the content covered in Chapter 1 (1.1 to 1.2) of Aho et al. - Compilers (2006)

0.0 - Sections

  1. Introduction to Syntax-Directed Translator
  2. Syntax Definition

1.0 - Introduction to Syntax-Directed Translator

  • The analysis phase of a compiler breaks up a source program into constituent pieces and produces an internal representation of the source program called the intermediate code

    • The analysis phase of compilation is centred around the syntax of the language to be compiled.
      • Syntax Describes the proper form of programs belonging to a programming language
      • Semantics defines what a program means, for a given programming language (i.e. what each program does when it is executed)
  • The synthesis phase translates the intermediate code into the target program

  • Syntax is specified using a notation called context-free grammars or BNF (Backus-Naur Form)

    • A context-free grammar can also be used to help guide translation of programs
      • Section 2.3 - Syntax-Directed Translation
      • Section 2.4 - Parsing / Syntax Analysis
      • Section 2.5 + Model of Compiler Front-End

2.0 - Syntax Definition

Aho et al, Chapter 2.2

  • Context Free Grammar (alt. Grammar) Used to specify the syntax of a language

  • A grammar describes the hierarchical structure of most programming language constructs.

  • Suppose we have the following if-else statement in Java syntax:

    if (expression) statement else statement
    
  • Suppose we use the variables β€˜stmt’ to denote a statement and β€˜expr’ to denote an expression.

  • Then we can create a structuring rule for an if-else statement as:

    stmtβ†’if (expr) stmt else stmt\it{stmt}\rightarrow\bold{if}\text{ (expr) } \it{stmt}\ \bold{else}\ \it{ stmt}

  • Note here that the arrow (β†’\rightarrow) is used to denote β€˜can have the form’.

  • Such a rule is called a production, and is the basis of grammars.

    • In a production, lexical elements like the keyword if\bold{if} and the parentheses are called terminals
    • Variables like β€˜expr’ and β€˜stmt’ represent sequences of terminals and are called non-terminals.

1.1 - Definition of Grammars

Aho et al, Chapter 2.2.1

  • A context-free grammar has four key components:
    1. A set of terminal symbols, sometimes referred to as β€œtokens” - The terminals are the elementary symbols of the language defined by the grammar.
    2. A set of nonterminals, sometimes called β€œsyntactic variables”. Each non-terminal represents a string of terminals.
    3. A set of productions, where each production consists of a non-terminal (LHS) and an arrow, and a sequence of terminals and/or nonterminals (RHS)
    4. A designation of one of the terminals as the start symbol (or if none specified, the LHS of the topmost production).
  • Grammars are specified by listing their productions, with productions for the start symbol listed first.
  • The following assumptions can be made about the formatting of grammar productions
    • Digits, signs such as {<,<=}\{<, <=\} and boldface strings such as while\bold{while} are terminal symbols.
    • Italicized name is non-terminal
    • Any non-italicized name or symbol is terminal.

🌱 We often want to create expressions consisting of digits, plus and minus signs, such as 9βˆ’5+2+3βˆ’1βˆ’79-5+2+3-1-7. Since a plus or a minus sign must appear between two digits, we refer to such expressions as β€œlists of digits separated by a plus or minus sign”.

  • The following grammar describes the syntax of these expressions, with the following productions:

listβ†’list + digitlistβ†’list βˆ’ digitlistβ†’digitdigitβ†’0 βˆ£ 1 βˆ£ 2 βˆ£ 3 βˆ£ 4 βˆ£ 5 βˆ£ 6 βˆ£ 7 βˆ£ 8 βˆ£ 9\textit{list}\rightarrow\textit{list}\ +\ \textit{digit}\\ \textit{list}\rightarrow\textit{list}\ -\ \textit{digit}\\ \textit{list}\rightarrow\textit{digit}\\ \textit{digit}\rightarrow 0\ |\ 1\ |\ 2\ |\ 3\ |\ 4\ |\ 5\ |\ 6\ |\ 7\ |\ 8\ |\ 9

  • The bodies of the three production with non-terminal list\textit{list} as head can equivalently be grouped as:

    listβ†’list + digit  βˆ£  digit  βˆ’  list  βˆ£  digit\textit{list}\rightarrow\textit{list}\ +\ \textit{digit}\ \ |\ \ \textit{digit} \ \ - \ \ \textit{list}\ \ | \ \ \textit{digit}

  • According to our conventions, the terminals of the grammar are the symbols

    +  βˆ’  0  1  2  3  4  5  6  7  8  9+\ \ -\ \ 0\ \ 1\ \ 2\ \ 3\ \ 4\ \ 5\ \ 6\ \ 7\ \ 8\ \ 9

  • The non-terminals are the italicized names list and digit, with list\textit{list} being the start symbol, because its productions are given first.

  • We say a production is for a non-terminal if the non-terminal is the head of the production

  • A string of terminals is a sequence of zero or more terminals.

  • The string of zero terminals, written as Ο΅\epsilon, is called the empty string.

1.2 - Grammar Derivations

Aho et al, Chapter 2.2.2

  • A grammar derives strings by beginning with the start symbol, and replacing a non-terminal by the body of a production for that non-terminal
  • The terminal strings that can be derived from the start symbol form the language defined by the grammar.

1.2.1 - List Production Example 1 - Productions & Lists

  • Consider the following productions from above

listβ†’list + digit  βˆ£  digit  βˆ’  list  βˆ£  digitdigitβ†’0 βˆ£ 1 βˆ£ 2 βˆ£ 3 βˆ£ 4 βˆ£ 5 βˆ£ 6 βˆ£ 7 βˆ£ 8 βˆ£ 9\textit{list}\rightarrow\textit{list}\ +\ \textit{digit}\ \ |\ \ \textit{digit} \ \ - \ \ \textit{list}\ \ | \ \ \textit{digit}\\ \textit{digit}\rightarrow 0\ |\ 1\ |\ 2\ |\ 3\ |\ 4\ |\ 5\ |\ 6\ |\ 7\ |\ 8\ |\ 9

  • From the last production, we know that a single digit itself is a list

  • Then from the first production in this grammar, this expands to any list followed by a plus or minus sign and then a new digit makes up a new list.

  • From this we can deduce:

    1. 99 is a list\textit{list} by the second production since 99 is a digit\textit{digit}
    2. 9βˆ’59-5 is a list\textit{list} by the production listβ†’listβˆ’digit\textit{list}\rightarrow\textit{list}-\textit{digit} since 99 is a list\textit{list} and 5 is a digit\textit{digit}
    3. 9βˆ’5+29-5+2 is a list\textit{list} by the production listβ†’list+digit\textit{list}\rightarrow\textit{list}+\textit{digit} since 9βˆ’59-5 is a list\textit{list} and 5 is a digit\textit{digit}

1.2.2 - List Production Example 2 - List of Parameters

  • Another type of list is the list of parameters in a function call.

  • In Java, the parameters are enclosed within parentheses, as in the call max(x,y) of function max with parameters x and y

  • One nuance of such a list is that an empty list of parameters may be found between the terminals ( and ).

  • We may start to develop a grammar for such sequences within the productions:

    callβ†’id(optparams)optparamsβ†’params  βˆ£  Ο΅paramsβ†’params  ,  param  βˆ£  param\textit{call}\rightarrow\bold{id}(\textit{optparams})\\ \textit{optparams}\rightarrow\textit{params}\ \ |\ \ \epsilon\\ \textit{params}\rightarrow\textit{params}\ \ ,\ \ \textit{param}\ \ |\ \ \textit{param}

    • Note that the second possible body for optparams\textit{optparams} (the optional parameter list) is Ο΅\epsilon, which stands for the empty string of symbols.
      • That is, optparams can be replaced by the empty string, so a call\textit{call} can consist of a function followed by the two-terminal string ()
      • Notice that the productions for params\textit{params} are analogous to those for list\textit{list} in the previous example with comma in place of the arithmetic operators {+,βˆ’}\{+,-\} and param\textit{param} in place of digit\textit{digit}
      • Note that the productions for param\textit{param} haven’t been shown as they’re essentially arbitrary parameters.

1.3 - Parse Trees

  • Parsing is the problem of taking a string of terminals and figuring out how to derive it from the start symbol of the grammar (and if it cannot be derived from the start symbol ofthe grammar, then reporting syntax errors within the strings

  • Parsing is one of the most fundamental problems in all of compiling

  • Will start by considering simple source programs like 9βˆ’5+29-5+2 in which each character is a terminal;

    • In general, a source program has multicharacter lexemes that are grouped by the lexical analyser into tokens, whose first components are the terminals processed by the parser
  • A parse tree pictorially shows how the start symbol of a grammar derives a string in a language.

  • If a non-terminal has a production Aβ†’XYZA\rightarrow XYZ then a parse tree may have an interior node labelled A with three children labelled X,YX, Y and ZZ from left to right.

Parse tree for

Parse tree for A→XYZA\rightarrow XYZ

  • Formally, given a context-free grammar, a parse tree according to the grammar is a tree with the following properties:
    1. The root is labelled by the start symbol
    2. Each leaf is labelled by a terminal or by Ο΅\epsilon
    3. Each interior node is labelled by a non-terminal.
    4. If AA is the non-terminal labelling some interior node and X1,X2,⋯ ,XnX_1, X_2, \cdots, X_n are the labels of the children of that node from left to right, then there must be a production Aβ†’X1X2β‹―XnA\rightarrow X_1X_2\cdots X_n.
      • Here, X1,X2,⋯ ,XnX_1, X_2,\cdots,X_n each stand for a symbol that is either a terminal or non-terminal
      • As a special case if Aβ†’Ο΅A\rightarrow\epsilon is a production, then a node labelled AA may have a single child labelled Ο΅\epsilon

1.4 - Ambiguity

  • Have to be careful in talking about the structure of a string according to a grammar
  • A grammar can have more than one parse tree generating a given string of terminals.
    • Such a grammar is said to be ambiguous
    • To show that a grammar is ambiguous, need to find a terminal string that is the yield of more than one parse tree
  • Since a string with more than one parse tree usually has more than one meaning, we need to design unambiguous for compiling applications, or to use ambiguous grammars with additional rules to resolve ambiguities

🌱 Suppose that in the example above, we used a single non-terminal, string\textit{string} and did not distinguish between digits and lists. Then, we could have written the grammar:

stringβ†’string+string  βˆ£  stringβˆ’string  βˆ£ 0 βˆ£ 1 βˆ£ 2 βˆ£ 3 βˆ£ 4 βˆ£ 5 βˆ£ 6 βˆ£ 7 βˆ£ 8 βˆ£ 9\textit{string}\rightarrow\textit{string}+\textit{string}\ \ |\ \ \textit{string}-\textit{string}\ \ |\ 0\ |\ 1\ |\ 2\ |\ 3\ |\ 4\ |\ 5\ |\ 6\ |\ 7\ |\ 8\ |\ 9

  • Merging the notion of a digit\textit{digit} and list\textit{list} into the non-terminal string\textit{string} makes superficial sense, because a single digit\textit{digit} is a special case of list\textit{list}

  • However, using this grammar an expression like 9βˆ’5+29-5+2 has more than one parse tree

    ![Untitled](/images/notes/COMP4403/Untitled 1.png)

    • We can either interpret the statement as (9-5)+2 or 9-(5+2) which have different results:
      • (9βˆ’5)+2=6(9-5)+2=6
      • 9βˆ’(5+2)=29-(5+2)=2

1.5 - Associativity

  • By convention, the statement 9+5+29+5+2 is equivalent to (9+5)+2(9+5)+2 and 9βˆ’5βˆ’29-5-2 is equivalent to (9βˆ’5)βˆ’2(9-5)-2.

  • When an operand has operators on its left and right, conventions are needed for deciding which operator applies to that operand.

  • We say that the operator + associates to the left, because an operation with a plus sign on both sides belongs to the operator to its left.

  • In most programming languages the four arithmetic operations {+,βˆ’,Γ—,Γ·}\{+,-,\times,\div\} are left-associative.

    • However, some common operators such as exponentiation are right-associative.
    • Additionally, the assignment operator == in C and its descendants are right-associative.
      • That is, the expression a=b=ca=b=c is treated in the same way as a=(b=c)a=(b=c)
  • Strings like a=b=ca=b=c with a right-associative operator are generated by the following grammar:

    rightβ†’letter=right  βˆ£  letterletterβ†’a βˆ£ b βˆ£ β‹― βˆ£ zright\rightarrow letter=right\ \ |\ \ letter\\ letter\rightarrow a\ |\ b\ |\ \cdots\ |\ z

  • The contrast between a parse tree for a left-associative operator like βˆ’- and a parse rtee for a right-associative operator like == is shown in the figure below (left - left associative, right - right associative)

    ![Figure: Parse Tree for Left-Associative Operators (LHS) and Right-Associative Operators (RHS)](/images/notes/COMP4403/Untitled 2.png)

    Figure: Parse Tree for Left-Associative Operators (LHS) and Right-Associative Operators (RHS)

1.6 - Precedence of Operators

  • Consider the expression 9+5βˆ—29+5*2. There are two possible interpretations of this expression - (9+5)βˆ—2(9+5)*2 or 9+(5βˆ—2)9+(5*2).

  • The associativity rules for ++ and βˆ—* apply to occurrences of the same operator, so they do not resolve this ambiguity.

  • Rules defining the relative precedence of operators are needed when more than one kind of operator is present

  • We say that βˆ—* has a higher precedence than ++ if βˆ—* takes its operands before ++ does.

  • In ordinary arithmetic, multiplication and division have higher precedence than addition and subtraction.

  • Therefore, 55 is taken by βˆ—* in both sentences 9+5βˆ—29+5*2 and 9βˆ—5+29*5+2.

    ![Untitled](/images/notes/COMP4403/Untitled 3.png)