Week 2.1

🏷️ Replace this section with some keywords that can be used for searching, tagging and the indexing of this page.

Sections

🌱 Links to all Heading1 Blocks in this page.

Context Free Grammars
Chomsky’s Hierarchy of Grammars

1.0 - Context Free Grammars

1.1 - Recap of Context-Free Grammars

The following context-free grammar has start symbol E, non-terminals $\{E, Op\}$ and terminals $\{"(",")", number, "+", "-", "*"\}$
$\begin{aligned} E &\rightarrow E\ Op\ E\\ E &\rightarrow "("\ E\ ")"\\ E &\rightarrow \text{number}\\ Op &\rightarrow "+"\\ Op &\rightarrow "-"\\ Op &\rightarrow "*" \end{aligned}$
- Our grammar is defined using a set of symbols - there are two types of symbols
  - Non-Terminal Symbols $\{E, Op\}$
  - Terminal Symbols $\{"(",")", number, "+", "-", "*"\}$ ${" (", ") ", n u mb er, " + ", " - ", " * "}$
    - The characters are tokens, and not syntax in BNF form
  - If not explicitly stated, the start symbol is the left symbol in the very top production
This context-free grammar is written in EBNF (Extended Backus-Naur Form)
The start symbol of a grammar is the left symbol in the topmost production (if not defined)
There must be a finite number of productions in the context-free grammar definition
The symbols on the left hand side of the productions are non-terminal symbols.
A grammar describes a language, in which a language is a set of sentences.
- Each sentence is a sequence of possibly-empty sequences in our grammar.
- Every sentence in the language can be directly derived from the start symbol of the grammar (through a series of direct derivation steps)

1.2 - Operator Precedence Example (Ambiguous Grammar)

E\rightarrow E\ "+"\ E\\ E\rightarrow E\ "*"\ E\\ E\rightarrow N

The above grammar is ambiguous - in which order do we parse the following sentence?
$E+E+E+E+E+E$
Do we want these operators to be left associative or right associative?
What do we want the relative precedence of the operators to be?

🌱 Can you come up with two parse trees for the expression

Figure 1: Parse tree for 1 + 2 * 3, expanding the + operator first

Figure 2: Parse tree for 1 + 2 * 3, expanding the * operator first

Note that in this example, the operator precedence is crucial for obtaining the right answer:

$\text{Left Parse Tree}\rightarrow1+(2*3)=7$ The “*” operator has higher precedence

$\text{Right Parse Tree}\rightarrow(1+2)*3=9$ The “+” operator has higher precedence

Observe here that the result of evaluating the two parse trees yield different numerical answers.

1.2.3 - Removing Ambiguity

🌱 Nodes further down in the parse tree (further away from the root node) get evaluated first. Therefore, operators with higher precedence should get chained further down in the productions.

Suppose that we want our operators to be left-associative (which we solved in the previous lecture) we need to solve the issue of operator precedence.
Consider the following grammar:
$E\rightarrow (E\ "+"\ T) | T\\ T\rightarrow (T\ "*"\ F) | F\\ F\rightarrow N$
- We introduce a non-terminal symbol, T
  - The production that defines T doesn’t have any addition in it - so we always do our rightmost addition operation first.
  - We want the multiplication operation to be done first, so we add a new production for it.
  - Notice that in the definition of the addition operation, we use the non-terminal symbol T
  - This means that at the point in which we perform an addition operation,
- We have also introduced a new non-terminal symbol F, which enforces the left-associative nature of the multiplication operation.
  - It is important that the production of F (and the subsequent productions) don’t have any occurrences of addition or multiplication in it.

1.2.4 - Parse Trees and Grammar for 1*2+3

🌱 Using the grammar defined above, create a parse tree for the sentence $1*2+3$

Untitled

🌱 Using the grammar defined above, create a parse tree for the sentence $1+2*3$

Untitled

1.3 - Operator Precedence

With a grammar like:

E\rightarrow E\ "+"\ E\\ E\rightarrow E\ "*"\ E\\ E\rightarrow N

The precedence of the operators is not specified, so that a sentence like $\color{lightblue}1+2*3$ is ambiguous; it can be interpreted as either $1+(2*3)$ or $(1+2)*3$ where the brackets just show the grouping.
To ensure that “+” has a lower precedence than “*” and to remove that ambiguity, we can rewrite the grammar to:
$E\rightarrow E\ "+"\ T | T\\ T\rightarrow T\ "*"\ F | F\\ F\rightarrow N$
- This grammar here also treats both operators as left associative.
- Note here that $N$ represents Number

1.3 - Overriding Operator Precedence Example

💡 Suppose we wanted to have the sentence $(1+2)*3$ in our language. How do we construct the parse tree for such a sentence? Note that by doing this, we’re essentially overwriting precedence using the parentheses.

We need to update our grammar to include the open and close parentheses
- Note here that $N$ represents Number
$\color{gray}E\rightarrow E\ "+"\ T | T\\ T\rightarrow T\ "*"\ F | F\\ F\rightarrow N |\color{white}"("\ E\ ")"$
The parse tree should look something like this

Untitled

1.4 - Ambiguous Grammar for Lists

The following ins an ambiguous grammar for lists with elements corresponding to the nonterminal $X$ (not detailed here)
- Assume here that $X$ is some type of generic non-terminal symbol that we want to be able to create lists of.
$L\rightarrow\epsilon\\ L\rightarrow X \\ L\rightarrow LL$
For example, a sentential form corresponding to a single $X$ can be derived via the following three derivation sequences (as well as many others)
$L\Rightarrow X$
$L\Rightarrow L\ L \Rightarrow L\ \epsilon=L \Rightarrow X$
$L\Rightarrow L\ L\Rightarrow L\ L\ L\Rightarrow L\ L\ \epsilon = L\ L \Rightarrow \epsilon\ L=L\Rightarrow X$
And here are the parse trees for the three different derivations.

💡 This grammar that we’ve defined is ambiguous - how do we modify it such that the definition is not ambiguous

Suppose we modified our grammar from before from:
$L\rightarrow\epsilon\\ L\rightarrow X \\ L\rightarrow LL$
to:
$L\rightarrow\epsilon\\ L\rightarrow X \\ L\rightarrow LX$
Though this grammar is still ambiguous, we can only derive the sentence $X$ using two parse trees
(An improvement from before)

Untitled

Now suppose that we remove the middle production from our grammar:

L\rightarrow\epsilon\\ L\rightarrow LX

By this definition, there is only one possible parse tree for the production of the sentence $X$
Removing the production $L\rightarrow X$ works as we already have a way of producing it through $L\rightarrow LX$ , where in the RHS of the equation, $L=\epsilon$

Untitled

In the above grammar, the production $L\rightarrow LL$ generates ambiguity because, if it is used in a derivation step, it generates containing $LL$ , and then, for example either $L$ symbol (the left $L$ or the right $L$ ) may be expanded to get the same sequence $LLL$ .
The ambiguity can be resolved by replacing the production, with say:
$L\rightarrow LX$
In addition, the definition of our grammar as $L\rightarrow\epsilon$ and $L\rightarrow LX$ together generate any sequence of zero or more occurrences of $X$ , including a single occurrence, meaning that the second alternative $L\rightarrow X$ is redundant and introduces ambiguity.
An unambiguous grammar that generates the same language (set of sentences) is:
$L\rightarrow\epsilon\\ L\rightarrow LX$

1.5 - Statement Sentence

Context Free Grammars for Programming Language Constructs - Statement Sequence

🌱 Define a grammar for a statement sequence, where each statement is separated by a semicolon (;)

\text{sentential forms}(SS)=\{{\color{lightblue}S},{\color{lightgreen}S;S},{\color{lightpink}S;S;S},\cdots\}

That is, our sentential forms are made out of the sentences:
1. $S$
2. $S;S$
3. $S;S;S$
4. …
These sentences can be generated using the following grammar:

SS \rightarrow S\\ SS \rightarrow SS\ \ ";"\ \ S

🌱 Define a grammar for a statement sentence, where each statement is terminated by a semicolon (;)

\text{sentential forms}(SS)=\{{\color{lightblue}S;},{\color{lightgreen}S;S;},{\color{lightpink}S;S;S;},\cdots\}

These sentences can be generated using the following grammar:
$SS\rightarrow S\ \ ":"\\ SS\rightarrow SS\ S\ ";"$
- Essentially when each statement is generated inside the production, a complimentary delimiter is added.

1.6 - Other Common Idioms

🌱 Suppose you want to want to have a single symbol or sequence $\beta$ , followed by zero or more symbols or sequences of $\alpha$

\text{sentential forms}(A)=\{\beta, \beta\alpha, \beta\alpha\alpha, \beta\alpha\alpha, \beta\alpha\alpha\alpha,\cdots\}\\ \bold{A}\rightarrow\bold{A}\alpha|\beta

This means that you can generate as follows:
- You could have $\bold{A}\alpha\Rightarrow\bold{A}\alpha\alpha\Rightarrow\cdots\Rightarrow\bold{A}\alpha\cdots\Rightarrow\beta\alpha\cdots$

🌱 Suppose you wanted zero or more occurrences of $\alpha$ followed by a $\beta$ symbol

\text{sentential forms}(B)=\{\beta, \alpha\beta, \alpha\alpha\beta,\cdots\}\\ \bold{B}\rightarrow\alpha\bold{B}|\beta

Observe that all we’ve essentially done is swapped the position of $\bold{A}$ $A$ or $\bold{B}$ $B$ with $\alpha$ $α$
- i.e. from $\bold{A}\alpha$ to $\alpha\bold{B}$

🌱 Programming Language Constructs - If Statement

In PL0, If Statements are currently specified as follows:
$S\rightarrow IfS\ |\ \cdots\\ IfS\rightarrow "if"\ C\ "then"\ S\ "else"\ S$
We can then add the option for an optional else portion:
$S\rightarrow IfS\ |\ \cdots\\ IfS\rightarrow "if"\ C\ "then"\ S\ "else"\ S\\ IfS\rightarrow"if"\ C\ "then"\ S$
- However, if we do this our grammar is ambiguous - consider the following sentence:
  $if\ C_1\ then\ if\ C_2\ then\ S_1\ else\ S_2$
- We have two possible parse trees generated for this statement.
- Ambiguity like this is possible in Java and C, and the ambiguity is resolved in the parser.
  - Mistake made in the definition of ALGOL60
  - In both languages, the else statements are always matched with its closest if.
  - This would then generate the parse tree shown on the RHS
How do we implement the optional else statement without having any ambiguity
1. Have brackets around the contents of the statements for the then and else statements.
  - E.g. $if\ \{C_1\}\ then\ \{if\ C_2\ then\{S_1\}\ else\ \{S_2\}\}$
  - $if\ \{C_1\}\ then\ \{if\ C_2\ then\{S_1\}\}\ else\ \{S_2\}$
  - Note that we don’t have to specifically have brackets - we could have begin and end keywords that indicate where sequences of statements start and end.
2. Modify the grammars t o denote when each of the conditional statements have ended.
$S\rightarrow IfS\ |\ \cdots\\ IfS\rightarrow "if"\ C\ "then"\ S\ "else"\ S\ {\color{lightgreen}fi}\\ IfS\rightarrow"if"\ C\ "then"\ S\ {\color{lightgreen}fi}$

2.0 - Chomsky’s Hierarchy of Grammars

A significant amount of research has been put into characterising and describing the grammars.
Chomsky’s Hierarchy of Grammars lists grammars in order of their expressivity.
- Least Expressive
- Type 3 Left (or or right) Linear Grammars
  - Able to have productions of the following forms:
  - Same expressivity as Regular Expressions (REGEX) and Finite Automatons (State Machines)
  - Since we use REGEX to describe the formatting of our lexical tokens, and we use Context-Free Grammars to describe the grammars of our programming languages, this ultimately means that the language that we use to express our lexical tokens is less expressive than the language that we use to describe our programming language.
  $\bold{A}\rightarrow\epsilon\\ \bold{A}\rightarrow a\bold{B}\\ \bold{A}\rightarrow \bold{a}$
- Type 2 Context-Free Grammars
  - Have the same expressivity as Pushdown Automatons (automaton with a stack)
  $\bold{A}\rightarrow\alpha$
- Type 1 Context-Sensitive Grammars
  $\beta\bold{A}\gamma\rightarrow\beta\alpha\gamma$
  - You’re able to say “You’re able to replace $\bold{A}$ with $\alpha$ if and only if it is preceded by the sequence $\beta$ and proceeded by the sequence $\gamma$ ”
    - Note here that $\beta$ and $\gamma$ being $\epsilon$ (sequence of empty symbols) is allowed.
- Type 0 Unrestricted Grammars
  - In an unrestricted grammar, the only restriction is that $\alpha\ne\epsilon$
  - Have the same expressivity as a Turing Machine - we could describe any program as an Unrestricted Grammar.
  $\alpha\rightarrow\beta$