Aho et al. Ch2.1 - 2.2 (Grammars)
Aho et al. Ch2.1 - 2.2 (Grammars)
π± This document summarises the content covered in Chapter 1 (1.1 to 1.2) of Aho et al. - Compilers (2006)
0.0 - Sections
1.0 - Introduction to Syntax-Directed Translator
The
analysisphase of a compilerbreaks up a source program into constituent piecesand produces aninternal representation of the source programcalled the intermediate code- The
analysisphase of compilation is centred around the syntax of the language to be compiled.SyntaxDescribes the proper form of programs belonging to a programming languageSemanticsdefines what a program means, for a given programming language (i.e. what each program does when it is executed)
- The
The
synthesisphasetranslates the intermediate codeinto thetarget programSyntax is specified using a notation called
context-free grammarsor BNF (Backus-Naur Form)- A context-free grammar can also be used to help guide translation of programs
- Section 2.3 - Syntax-Directed Translation
- Section 2.4 - Parsing / Syntax Analysis
- Section 2.5 + Model of Compiler Front-End
- A context-free grammar can also be used to help guide translation of programs
2.0 - Syntax Definition
Aho et al, Chapter 2.2
Context Free Grammar (alt. Grammar)Used to specify the syntax of a languageA grammar describes the hierarchical structure of most programming language constructs.
Suppose we have the following
if-elsestatement in Java syntax:if (expression) statement else statementSuppose we use the variables βstmtβ to denote a statement and βexprβ to denote an expression.
Then we can create a structuring rule for an if-else statement as:
Note here that the arrow () is used to denote βcan have the formβ.
Such a rule is called a
production, and is the basis of grammars.- In a production, lexical elements like the keyword and the parentheses are called
terminals - Variables like βexprβ and βstmtβ represent sequences of terminals and are called non-terminals.
- In a production, lexical elements like the keyword and the parentheses are called
1.1 - Definition of Grammars
Aho et al, Chapter 2.2.1
- A context-free grammar has four key components:
- A set of
terminal symbols, sometimes referred to as βtokensβ - The terminals are the elementary symbols of the language defined by the grammar. - A set of
nonterminals, sometimes called βsyntactic variablesβ. Each non-terminal represents a string of terminals. - A set of productions, where each production consists of a non-terminal (LHS) and an arrow, and a sequence of
terminalsand/ornonterminals(RHS) - A designation of one of the terminals as the start symbol (or if none specified, the LHS of the topmost production).
- A set of
- Grammars are specified by listing their productions, with productions for the start symbol listed first.
- The following assumptions can be made about the formatting of grammar productions
- Digits, signs such as and boldface strings such as are terminal symbols.
- Italicized name is non-terminal
- Any non-italicized name or symbol is terminal.
π± We often want to create expressions consisting of digits, plus and minus signs, such as . Since a plus or a minus sign must appear between two digits, we refer to such expressions as βlists of digits separated by a plus or minus signβ.
- The following grammar describes the syntax of these expressions, with the following productions:
The bodies of the three production with
non-terminalas head can equivalently be grouped as:According to our conventions, the
terminalsof the grammar are the symbolsThe
non-terminalsare the italicized names list and digit, with being the start symbol, because its productions are given first.We say a production is for a
non-terminalif the non-terminal is the head of the productionA string of
terminalsis a sequence of zero or more terminals.The string of zero terminals, written as , is called the empty string.
1.2 - Grammar Derivations
Aho et al, Chapter 2.2.2
- A grammar derives strings by beginning with the start symbol, and replacing a non-terminal by the body of a production for that non-terminal
- The terminal strings that can be derived from the start symbol form the language defined by the grammar.
1.2.1 - List Production Example 1 - Productions & Lists
- Consider the following productions from above
From the last production, we know that a single digit itself is a list
Then from the first production in this grammar, this expands to any list followed by a plus or minus sign and then a new digit makes up a new list.
From this we can deduce:
- is a by the second production since is a
- is a by the production since is a and 5 is a
- is a by the production since is a and 5 is a
1.2.2 - List Production Example 2 - List of Parameters
Another type of list is the list of parameters in a function call.
In Java, the parameters are enclosed within parentheses, as in the call
max(x,y)of functionmaxwith parametersxandyOne nuance of such a list is that an empty list of parameters may be found between the terminals
(and).We may start to develop a grammar for such sequences within the productions:
- Note that the second possible body for (the optional parameter list) is , which stands for the empty string of symbols.
- That is,
optparamscan be replaced by the empty string, so a can consist of a function followed by the two-terminal string() - Notice that the productions for are analogous to those for in the previous example with comma in place of the arithmetic operators and in place of
- Note that the productions for havenβt been shown as theyβre essentially arbitrary parameters.
- That is,
- Note that the second possible body for (the optional parameter list) is , which stands for the empty string of symbols.
1.3 - Parse Trees
Parsing is the problem of taking a string of terminals and figuring out how to derive it from the start symbol of the grammar (and if it cannot be derived from the start symbol ofthe grammar, then reporting syntax errors within the strings
Parsing is one of the most fundamental problems in all of compiling
Will start by considering simple source programs like in which each character is a terminal;
- In general, a source program has multicharacter lexemes that are grouped by the lexical analyser into tokens, whose first components are the terminals processed by the parser
A parse tree pictorially shows how the start symbol of a grammar derives a string in a language.
If a non-terminal has a production then a parse tree may have an interior node labelled A with three children labelled and from left to right.

Parse tree for
- Formally, given a context-free grammar, a parse tree according to the grammar is a tree with the following properties:
- The root is labelled by the start symbol
- Each leaf is labelled by a terminal or by
- Each interior node is labelled by a non-terminal.
- If is the non-terminal labelling some interior node and are the labels of the children of that node from left to right, then there must be a production .
- Here, each stand for a symbol that is either a terminal or non-terminal
- As a special case if is a production, then a node labelled may have a single child labelled
1.4 - Ambiguity
- Have to be careful in talking about the structure of a string according to a grammar
- A grammar can have more than one parse tree generating a given string of terminals.
- Such a grammar is said to be ambiguous
- To show that a grammar is ambiguous, need to find a terminal string that is the yield of more than one parse tree
- Since a string with more than one parse tree usually has more than one meaning, we need to design unambiguous for compiling applications, or to use ambiguous grammars with additional rules to resolve ambiguities
π± Suppose that in the example above, we used a single non-terminal, and did not distinguish between digits and lists. Then, we could have written the grammar:
Merging the notion of a and into the non-terminal makes superficial sense, because a single is a special case of
However, using this grammar an expression like has more than one parse tree

- We can either interpret the statement as (9-5)+2 or 9-(5+2) which have different results:
- We can either interpret the statement as (9-5)+2 or 9-(5+2) which have different results:
1.5 - Associativity
By convention, the statement is equivalent to and is equivalent to .
When an operand has operators on its left and right, conventions are needed for deciding which operator applies to that operand.
We say that the operator + associates to the left, because an operation with a plus sign on both sides belongs to the operator to its left.
In most programming languages the four arithmetic operations are left-associative.
- However, some common operators such as exponentiation are right-associative.
- Additionally, the assignment operator in C and its descendants are right-associative.
- That is, the expression is treated in the same way as
Strings like with a right-associative operator are generated by the following grammar:
The contrast between a parse tree for a left-associative operator like and a parse rtee for a right-associative operator like is shown in the figure below (left - left associative, right - right associative)

Figure: Parse Tree for Left-Associative Operators (LHS) and Right-Associative Operators (RHS)
1.6 - Precedence of Operators
Consider the expression . There are two possible interpretations of this expression - or .
The associativity rules for and apply to occurrences of the same operator, so they do not resolve this ambiguity.
Rules defining the relative precedence of operators are needed when more than one kind of operator is present
We say that has a higher precedence than if takes its operands before does.
In ordinary arithmetic, multiplication and division have higher precedence than addition and subtraction.
Therefore, is taken by in both sentences and .
