Aho et al. Ch2.1 - 2.2 (Grammars)

🌱 This document summarises the content covered in Chapter 1 (1.1 to 1.2) of Aho et al. - Compilers (2006)

0.0 - Sections

1.0 - Introduction to Syntax-Directed Translator

The analysis phase of a compiler breaks up a source program into constituent pieces and produces an internal representation of the source program called the intermediate code
- The analysis phase of compilation is centred around the syntax of the language to be compiled.
  - Syntax Describes the proper form of programs belonging to a programming language
  - Semantics defines what a program means, for a given programming language (i.e. what each program does when it is executed)
The synthesis phase translates the intermediate code into the target program
Syntax is specified using a notation called context-free grammars or BNF (Backus-Naur Form)
- A context-free grammar can also be used to help guide translation of programs
  - Section 2.3 - Syntax-Directed Translation
  - Section 2.4 - Parsing / Syntax Analysis
  - Section 2.5 + Model of Compiler Front-End

2.0 - Syntax Definition

Aho et al, Chapter 2.2

Context Free Grammar (alt. Grammar) Used to specify the syntax of a language
A grammar describes the hierarchical structure of most programming language constructs.
Suppose we have the following if-else statement in Java syntax:
```
if (expression) statement else statement
```
Suppose we use the variables ‘stmt’ to denote a statement and ‘expr’ to denote an expression.
Then we can create a structuring rule for an if-else statement as:
$\it{stmt}\rightarrow\bold{if}\text{ (expr) } \it{stmt}\ \bold{else}\ \it{ stmt}$
Note here that the arrow ( $\rightarrow$ ) is used to denote ‘can have the form’.
Such a rule is called a production, and is the basis of grammars.
- In a production, lexical elements like the keyword $\bold{if}$ and the parentheses are called terminals
- Variables like ‘expr’ and ‘stmt’ represent sequences of terminals and are called non-terminals.

1.1 - Definition of Grammars

Aho et al, Chapter 2.2.1

A context-free grammar has four key components:
1. A set of terminal symbols, sometimes referred to as “tokens” - The terminals are the elementary symbols of the language defined by the grammar.
2. A set of nonterminals, sometimes called “syntactic variables”. Each non-terminal represents a string of terminals.
3. A set of productions, where each production consists of a non-terminal (LHS) and an arrow, and a sequence of terminals and/or nonterminals (RHS)
4. A designation of one of the terminals as the start symbol (or if none specified, the LHS of the topmost production).
Grammars are specified by listing their productions, with productions for the start symbol listed first.
The following assumptions can be made about the formatting of grammar productions
- Digits, signs such as $\{<, <=\}$ and boldface strings such as $\bold{while}$ are terminal symbols.
- Italicized name is non-terminal
- Any non-italicized name or symbol is terminal.

🌱 We often want to create expressions consisting of digits, plus and minus signs, such as $9-5+2+3-1-7$ . Since a plus or a minus sign must appear between two digits, we refer to such expressions as “lists of digits separated by a plus or minus sign”.

The following grammar describes the syntax of these expressions, with the following productions:

\textit{list}\rightarrow\textit{list}\ +\ \textit{digit}\\ \textit{list}\rightarrow\textit{list}\ -\ \textit{digit}\\ \textit{list}\rightarrow\textit{digit}\\ \textit{digit}\rightarrow 0\ |\ 1\ |\ 2\ |\ 3\ |\ 4\ |\ 5\ |\ 6\ |\ 7\ |\ 8\ |\ 9

The bodies of the three production with non-terminal $\textit{list}$ as head can equivalently be grouped as:
$\textit{list}\rightarrow\textit{list}\ +\ \textit{digit}\ \ |\ \ \textit{digit} \ \ - \ \ \textit{list}\ \ | \ \ \textit{digit}$
According to our conventions, the terminals of the grammar are the symbols
$+\ \ -\ \ 0\ \ 1\ \ 2\ \ 3\ \ 4\ \ 5\ \ 6\ \ 7\ \ 8\ \ 9$
The non-terminals are the italicized names list and digit, with $\textit{list}$ being the start symbol, because its productions are given first.
We say a production is for a non-terminal if the non-terminal is the head of the production
A string of terminals is a sequence of zero or more terminals.
The string of zero terminals, written as $\epsilon$ , is called the empty string.

1.2 - Grammar Derivations

Aho et al, Chapter 2.2.2

A grammar derives strings by beginning with the start symbol, and replacing a non-terminal by the body of a production for that non-terminal
The terminal strings that can be derived from the start symbol form the language defined by the grammar.

1.2.1 - List Production Example 1 - Productions & Lists

Consider the following productions from above

\textit{list}\rightarrow\textit{list}\ +\ \textit{digit}\ \ |\ \ \textit{digit} \ \ - \ \ \textit{list}\ \ | \ \ \textit{digit}\\ \textit{digit}\rightarrow 0\ |\ 1\ |\ 2\ |\ 3\ |\ 4\ |\ 5\ |\ 6\ |\ 7\ |\ 8\ |\ 9

From the last production, we know that a single digit itself is a list
Then from the first production in this grammar, this expands to any list followed by a plus or minus sign and then a new digit makes up a new list.
From this we can deduce:
1. $9$ is a $\textit{list}$ by the second production since $9$ is a $\textit{digit}$
2. $9-5$ is a $\textit{list}$ by the production $\textit{list}\rightarrow\textit{list}-\textit{digit}$ since $9$ is a $\textit{list}$ and 5 is a $\textit{digit}$
3. $9-5+2$ is a $\textit{list}$ by the production $\textit{list}\rightarrow\textit{list}+\textit{digit}$ since $9-5$ is a $\textit{list}$ and 5 is a $\textit{digit}$

1.2.2 - List Production Example 2 - List of Parameters

Another type of list is the list of parameters in a function call.
In Java, the parameters are enclosed within parentheses, as in the call max(x,y) of function max with parameters x and y
One nuance of such a list is that an empty list of parameters may be found between the terminals ( and ).
We may start to develop a grammar for such sequences within the productions:
$\textit{call}\rightarrow\bold{id}(\textit{optparams})\\ \textit{optparams}\rightarrow\textit{params}\ \ |\ \ \epsilon\\ \textit{params}\rightarrow\textit{params}\ \ ,\ \ \textit{param}\ \ |\ \ \textit{param}$
- Note that the second possible body for $\textit{optparams}$ $optparams$ (the optional parameter list) is $\epsilon$ $ϵ$ , which stands for the empty string of symbols.
  - That is, optparams can be replaced by the empty string, so a $\textit{call}$ can consist of a function followed by the two-terminal string ()
  - Notice that the productions for $\textit{params}$ are analogous to those for $\textit{list}$ in the previous example with comma in place of the arithmetic operators $\{+,-\}$ and $\textit{param}$ in place of $\textit{digit}$
  - Note that the productions for $\textit{param}$ haven’t been shown as they’re essentially arbitrary parameters.

1.3 - Parse Trees

Parsing is the problem of taking a string of terminals and figuring out how to derive it from the start symbol of the grammar (and if it cannot be derived from the start symbol ofthe grammar, then reporting syntax errors within the strings
Parsing is one of the most fundamental problems in all of compiling
Will start by considering simple source programs like $9-5+2$ in which each character is a terminal;
- In general, a source program has multicharacter lexemes that are grouped by the lexical analyser into tokens, whose first components are the terminals processed by the parser
A parse tree pictorially shows how the start symbol of a grammar derives a string in a language.
If a non-terminal has a production $A\rightarrow XYZ$ then a parse tree may have an interior node labelled A with three children labelled $X, Y$ and $Z$ from left to right.

Parse tree for

Parse tree for $A\rightarrow XYZ$

Formally, given a context-free grammar, a parse tree according to the grammar is a tree with the following properties:
1. The root is labelled by the start symbol
2. Each leaf is labelled by a terminal or by $\epsilon$
3. Each interior node is labelled by a non-terminal.
4. If $A$ $A$ is the non-terminal labelling some interior node and $X_1, X_2, \cdots, X_n$ $X_{1}, X_{2}, \dots, X_{n}$ are the labels of the children of that node from left to right, then there must be a production $A\rightarrow X_1X_2\cdots X_n$ $A \to X_{1} X_{2} \dots X_{n}$ .
  - Here, $X_1, X_2,\cdots,X_n$ each stand for a symbol that is either a terminal or non-terminal
  - As a special case if $A\rightarrow\epsilon$ is a production, then a node labelled $A$ may have a single child labelled $\epsilon$

1.4 - Ambiguity

Have to be careful in talking about the structure of a string according to a grammar
A grammar can have more than one parse tree generating a given string of terminals.
- Such a grammar is said to be ambiguous
- To show that a grammar is ambiguous, need to find a terminal string that is the yield of more than one parse tree
Since a string with more than one parse tree usually has more than one meaning, we need to design unambiguous for compiling applications, or to use ambiguous grammars with additional rules to resolve ambiguities

🌱 Suppose that in the example above, we used a single non-terminal, $\textit{string}$ and did not distinguish between digits and lists. Then, we could have written the grammar:

\textit{string}\rightarrow\textit{string}+\textit{string}\ \ |\ \ \textit{string}-\textit{string}\ \ |\ 0\ |\ 1\ |\ 2\ |\ 3\ |\ 4\ |\ 5\ |\ 6\ |\ 7\ |\ 8\ |\ 9

Merging the notion of a $\textit{digit}$ and $\textit{list}$ into the non-terminal $\textit{string}$ makes superficial sense, because a single $\textit{digit}$ is a special case of $\textit{list}$
However, using this grammar an expression like $9-5+2$ has more than one parse tree
![Untitled](/images/notes/COMP4403/Untitled 1.png)
- We can either interpret the statement as (9-5)+2 or 9-(5+2) which have different results:
  - $(9-5)+2=6$
  - $9-(5+2)=2$

1.5 - Associativity

By convention, the statement $9+5+2$ is equivalent to $(9+5)+2$ and $9-5-2$ is equivalent to $(9-5)-2$ .
When an operand has operators on its left and right, conventions are needed for deciding which operator applies to that operand.
We say that the operator + associates to the left, because an operation with a plus sign on both sides belongs to the operator to its left.
In most programming languages the four arithmetic operations $\{+,-,\times,\div\}$ are left-associative.
- However, some common operators such as exponentiation are right-associative.
- Additionally, the assignment operator $=$ $=$ in C and its descendants are right-associative.
  - That is, the expression $a=b=c$ is treated in the same way as $a=(b=c)$
Strings like $a=b=c$ with a right-associative operator are generated by the following grammar:
$right\rightarrow letter=right\ \ |\ \ letter\\ letter\rightarrow a\ |\ b\ |\ \cdots\ |\ z$
The contrast between a parse tree for a left-associative operator like $-$ and a parse rtee for a right-associative operator like $=$ is shown in the figure below (left - left associative, right - right associative)
![Figure: Parse Tree for Left-Associative Operators (LHS) and Right-Associative Operators (RHS)](/images/notes/COMP4403/Untitled 2.png)
Figure: Parse Tree for Left-Associative Operators (LHS) and Right-Associative Operators (RHS)

1.6 - Precedence of Operators

Consider the expression $9+5*2$ . There are two possible interpretations of this expression - $(9+5)*2$ or $9+(5*2)$ .
The associativity rules for $+$ and $*$ apply to occurrences of the same operator, so they do not resolve this ambiguity.
Rules defining the relative precedence of operators are needed when more than one kind of operator is present
We say that $*$ has a higher precedence than $+$ if $*$ takes its operands before $+$ does.
In ordinary arithmetic, multiplication and division have higher precedence than addition and subtraction.
Therefore, $5$ is taken by $*$ in both sentences $9+5*2$ and $9*5+2$ .
![Untitled](/images/notes/COMP4403/Untitled 3.png)