Week 2.1

CourseCompilers & Interpreters
SemesterS1 2022

Week 2.1

🏷️ Replace this section with some keywords that can be used for searching, tagging and the indexing of this page.

Sections

🌱 Links to all Heading1 Blocks in this page.

  1. Context Free Grammars
  2. Chomsky’s Hierarchy of Grammars

1.0 - Context Free Grammars

1.1 - Recap of Context-Free Grammars

  • The following context-free grammar has start symbol E, non-terminals {E,Op}\{E, Op\} and terminals {"(",")",number,"+","βˆ’","βˆ—"}\{"(",")", number, "+", "-", "*"\}

    Eβ†’E Op EEβ†’"(" E ")"Eβ†’numberOpβ†’"+"Opβ†’"βˆ’"Opβ†’"βˆ—"\begin{aligned} E &\rightarrow E\ Op\ E\\ E &\rightarrow "("\ E\ ")"\\ E &\rightarrow \text{number}\\ Op &\rightarrow "+"\\ Op &\rightarrow "-"\\ Op &\rightarrow "*" \end{aligned}

    • Our grammar is defined using a set of symbols - there are two types of symbols
      • Non-Terminal Symbols {E,Op}\{E, Op\}
      • Terminal Symbols {"(",")",number,"+","βˆ’","βˆ—"}\{"(",")", number, "+", "-", "*"\}
        • The characters are tokens, and not syntax in BNF form
      • If not explicitly stated, the start symbol is the left symbol in the very top production
  • This context-free grammar is written in EBNF (Extended Backus-Naur Form)

  • The start symbol of a grammar is the left symbol in the topmost production (if not defined)

  • There must be a finite number of productions in the context-free grammar definition

  • The symbols on the left hand side of the productions are non-terminal symbols.

  • A grammar describes a language, in which a language is a set of sentences.

    • Each sentence is a sequence of possibly-empty sequences in our grammar.
    • Every sentence in the language can be directly derived from the start symbol of the grammar (through a series of direct derivation steps)

1.2 - Operator Precedence Example (Ambiguous Grammar)

Eβ†’E "+" EEβ†’E "βˆ—" EEβ†’NE\rightarrow E\ "+"\ E\\ E\rightarrow E\ "*"\ E\\ E\rightarrow N

  • The above grammar is ambiguous - in which order do we parse the following sentence?

    E+E+E+E+E+EE+E+E+E+E+E

  • Do we want these operators to be left associative or right associative?

  • What do we want the relative precedence of the operators to be?

🌱 Can you come up with two parse trees for the expression

Figure 1: Parse tree for 1 + 2 * 3, expanding the + operator first

Figure 1: Parse tree for 1 + 2 * 3, expanding the + operator first

Figure 2: Parse tree for 1 + 2 * 3, expanding the * operator first

Figure 2: Parse tree for 1 + 2 * 3, expanding the * operator first

Note that in this example, the operator precedence is crucial for obtaining the right answer:

\text{Left Parse Tree}\rightarrow1+(2*3)=7$ The β€œ+” operator has higher precedence $\text{Right Parse Tree}\rightarrow(1+2)*3=9$ The β€œ*” operator has higher precedence Observe here that the result of evaluating the two parse trees yield different numerical answers. ### 1.2.3 - Removing Ambiguity 🌱 Nodes further down in the parse tree (further away from the root node) get evaluated first. Therefore, operators with higher precedence should get chained further down in the productions. - Suppose that we want our operators to be left-associative (which we solved in the previous lecture) we need to solve the issue of operator precedence. - Consider the following grammar:

E\rightarrow (E\ "+"\ T) | T\\
T\rightarrow (T\ "*"\ F) | F\\
F\rightarrow N
$$

- We introduce a non-terminal symbol, T
    - The production that defines T doesn’t have any addition in it - so we always do our rightmost addition operation first.
    - We want the multiplication operation to be done first, so we add a new production for it.
    - Notice that in the definition of the addition operation, we use the non-terminal symbol T
    - This means that at the point in which we perform an addition operation,
- We have also introduced a new non-terminal symbol F, which enforces the left-associative nature of the multiplication operation.
    - It is important that the production of F (and the subsequent productions) don’t have any occurrences of addition or multiplication in it.

1.2.4 - Parse Trees and Grammar for 1*2+3

🌱 Using the grammar defined above, create a parse tree for the sentence 1βˆ—2+31*2+3$ Untitled

🌱 Using the grammar defined above, create a parse tree for the sentence 1+2βˆ—31+2*3

Untitled

1.3 - Operator Precedence

  • With a grammar like:

Eβ†’E "+" EEβ†’E "βˆ—" EEβ†’NE\rightarrow E\ "+"\ E\\ E\rightarrow E\ "*"\ E\\ E\rightarrow N

  • The precedence of the operators is not specified, so that a sentence like 1+2βˆ—3\color{lightblue}1+2*3 is ambiguous; it can be interpreted as either 1+(2βˆ—3)1+(2*3) or (1+2)βˆ—3(1+2)*3 where the brackets just show the grouping.

  • To ensure that β€œ+” has a lower precedence than β€œ*” and to remove that ambiguity, we can rewrite the grammar to:

    Eβ†’E "+" T∣TTβ†’T "βˆ—" F∣FFβ†’NE\rightarrow E\ "+"\ T | T\\ T\rightarrow T\ "*"\ F | F\\ F\rightarrow N

    • This grammar here also treats both operators as right associative.
    • Note here that NN represents Number

1.3 - Overriding Operator Precedence Example

πŸ’‘ Suppose we wanted to have the sentence (1+2)βˆ—3(1+2)*3 in our language. How do we construct the parse tree for such a sentence? Note that by doing this, we’re essentially overwriting precedence using the parentheses.

  • We need to update our grammar to include the open and close parentheses

    • Note here that NN represents Number

    Eβ†’E "+" T∣TTβ†’T "βˆ—" F∣FFβ†’N∣"(" E ")"\color{gray}E\rightarrow E\ "+"\ T | T\\ T\rightarrow T\ "*"\ F | F\\ F\rightarrow N |\color{white}"("\ E\ ")"

  • The parse tree should look something like this

Untitled

1.4 - Ambiguous Grammar for Lists

  • The following ins an ambiguous grammar for lists with elements corresponding to the nonterminal XX (not detailed here)

    • Assume here that XX is some type of generic non-terminal symbol that we want to be able to create lists of.

    L→ϡL→XL→LLL\rightarrow\epsilon\\ L\rightarrow X \\ L\rightarrow LL

  • For example, a sentential form corresponding to a single XX can be derived via the following three derivation sequences (as well as many others)

    L⇒XL\Rightarrow X

    Lβ‡’L Lβ‡’L Ο΅=Lβ‡’XL\Rightarrow L\ L \Rightarrow L\ \epsilon=L \Rightarrow X

    Lβ‡’L Lβ‡’L L Lβ‡’L L Ο΅=L Lβ‡’Ο΅ L=Lβ‡’XL\Rightarrow L\ L\Rightarrow L\ L\ L\Rightarrow L\ L\ \epsilon = L\ L \Rightarrow \epsilon\ L=L\Rightarrow X

  • And here are the parse trees for the three different derivations.

    Untitled


πŸ’‘ This grammar that we’ve defined is ambiguous - how do we modify it such that the definition is not ambiguous

  • Suppose we modified our grammar from before from:

    L→ϡL→XL→LLL\rightarrow\epsilon\\ L\rightarrow X \\ L\rightarrow LL

  • to:

    L→ϡL→XL→LXL\rightarrow\epsilon\\ L\rightarrow X \\ L\rightarrow LX

  • Though this grammar is still ambiguous, we can only derive the sentence XX using two parse trees

  • (An improvement from before)

Untitled


  • Now suppose that we remove the middle production from our grammar:

L→ϡL→LXL\rightarrow\epsilon\\ L\rightarrow LX

  • By this definition, there is only one possible parse tree for the production of the sentence XX
  • Removing the production Lβ†’XL\rightarrow X works as we already have a way of producing it through Lβ†’LXL\rightarrow LX, where in the RHS of the equation, L=Ο΅L=\epsilon

Untitled

  • In the above grammar, the production Lβ†’LLL\rightarrow LL generates ambiguity because, if it is used in a derivation step, it generates containing LLLL, and then, for example either LL symbol (the left LL or the right LL) may be expanded to get the same sequence LLLLLL.

  • The ambiguity can be resolved by replacing the production, with say:

    L→LXL\rightarrow LX

  • In addition, the definition of our grammar as Lβ†’Ο΅L\rightarrow\epsilon and Lβ†’LXL\rightarrow LX together generate any sequence of zero or more occurrences of XX, including a single occurrence, meaning that the second alternative Lβ†’XL\rightarrow X is redundant and introduces ambiguity.

  • An unambiguous grammar that generates the same language (set of sentences) is:

    L→ϡL→LXL\rightarrow\epsilon\\ L\rightarrow LX

1.5 - Statement Sentence

Context Free Grammars for Programming Language Constructs - Statement Sequence

🌱 Define a grammar for a statement sequence, where each statement is separated by a semicolon (;)

sentential forms(SS)={S,S;S,S;S;S,⋯ }\text{sentential forms}(SS)=\{{\color{lightblue}S},{\color{lightgreen}S;S},{\color{lightpink}S;S;S},\cdots\}

  • That is, our sentential forms are made out of the sentences:
    1. SS
    2. S;SS;S
    3. S;S;SS;S;S
    4. …
  • These sentences can be generated using the following grammar:

SSβ†’SSSβ†’SS  ";"  SSS \rightarrow S\\ SS \rightarrow SS\ \ ";"\ \ S


🌱 Define a grammar for a statement sentence, where each statement is terminated by a semicolon (;)

sentential forms(SS)={S;,S;S;,S;S;S;,⋯ }\text{sentential forms}(SS)=\{{\color{lightblue}S;},{\color{lightgreen}S;S;},{\color{lightpink}S;S;S;},\cdots\}

  • These sentences can be generated using the following grammar:

    SSβ†’S  ":"SSβ†’SS S ";"SS\rightarrow S\ \ ":"\\ SS\rightarrow SS\ S\ ";"

    • Essentially when each statement is generated inside the production, a complimentary delimiter is added.

1.6 - Other Common Idioms

🌱 Suppose you want to want to have a single symbol or sequence β\beta, followed by zero or more symbols or sequences of α\alpha

sentential forms(A)={Ξ²,Ξ²Ξ±,Ξ²Ξ±Ξ±,Ξ²Ξ±Ξ±,Ξ²Ξ±Ξ±Ξ±,⋯ }Aβ†’Aα∣β\text{sentential forms}(A)=\{\beta, \beta\alpha, \beta\alpha\alpha, \beta\alpha\alpha, \beta\alpha\alpha\alpha,\cdots\}\\ \bold{A}\rightarrow\bold{A}\alpha|\beta

  • This means that you can generate as follows:
    • You could have AΞ±β‡’AΞ±Ξ±β‡’β‹―β‡’AΞ±β‹―β‡’Ξ²Ξ±β‹―\bold{A}\alpha\Rightarrow\bold{A}\alpha\alpha\Rightarrow\cdots\Rightarrow\bold{A}\alpha\cdots\Rightarrow\beta\alpha\cdots

🌱 Suppose you wanted zero or more occurrences of α\alpha followed by a β\beta symbol

sentential forms(B)={Ξ²,Ξ±Ξ²,Ξ±Ξ±Ξ²,⋯ }Bβ†’Ξ±B∣β\text{sentential forms}(B)=\{\beta, \alpha\beta, \alpha\alpha\beta,\cdots\}\\ \bold{B}\rightarrow\alpha\bold{B}|\beta

  • Observe that all we’ve essentially done is swapped the position of A\bold{A} or B\bold{B} with Ξ±\alpha
    • i.e. from AΞ±\bold{A}\alpha to Ξ±B\alpha\bold{B}

🌱 Programming Language Constructs - If Statement

  • In PL0, If Statements are currently specified as follows:

    Sβ†’IfS βˆ£ β‹―IfSβ†’"if" C "then" S "else" SS\rightarrow IfS\ |\ \cdots\\ IfS\rightarrow "if"\ C\ "then"\ S\ "else"\ S

  • We can then add the option for an optional else portion:

    Sβ†’IfS βˆ£ β‹―IfSβ†’"if" C "then" S "else" SIfSβ†’"if" C "then" SS\rightarrow IfS\ |\ \cdots\\ IfS\rightarrow "if"\ C\ "then"\ S\ "else"\ S\\ IfS\rightarrow"if"\ C\ "then"\ S

    • However, if we do this our grammar is ambiguous - consider the following sentence:

      if C1 then if C2 then S1 else S2if\ C_1\ then\ if\ C_2\ then\ S_1\ else\ S_2

    • We have two possible parse trees generated for this statement.

      Untitled

    • Ambiguity like this is possible in Java and C, and the ambiguity is resolved in the parser.

      • Mistake made in the definition of ALGOL60
      • In both languages, the else statements are always matched with its closest if.
      • This would then generate the parse tree shown on the RHS
  • How do we implement the optional else statement without having any ambiguity

    1. Have brackets around the contents of the statements for the then and else statements.
      • E.g. if {C1} then {if C2 then{S1} else {S2}}if\ \{C_1\}\ then\ \{if\ C_2\ then\{S_1\}\ else\ \{S_2\}\}
      • if {C1} then {if C2 then{S1}} else {S2}if\ \{C_1\}\ then\ \{if\ C_2\ then\{S_1\}\}\ else\ \{S_2\}
      • Note that we don’t have to specifically have brackets - we could have begin and end keywords that indicate where sequences of statements start and end.
    2. Modify the grammars t o denote when each of the conditional statements have ended.

    Sβ†’IfS βˆ£ β‹―IfSβ†’"if" C "then" S "else" S fiIfSβ†’"if" C "then" S fiS\rightarrow IfS\ |\ \cdots\\ IfS\rightarrow "if"\ C\ "then"\ S\ "else"\ S\ {\color{lightgreen}fi}\\ IfS\rightarrow"if"\ C\ "then"\ S\ {\color{lightgreen}fi}

2.0 - Chomsky’s Hierarchy of Grammars

  • A significant amount of research has been put into characterising and describing the grammars.
  • Chomsky’s Hierarchy of Grammars lists grammars in order of their expressivity.
    • Least Expressive

    • Type 3 Left (or or right) Linear Grammars

      • Able to have productions of the following forms:
      • Same expressivity as Regular Expressions (REGEX) and Finite Automatons (State Machines)
      • Since we use REGEX to describe the formatting of our lexical tokens, and we use Context-Free Grammars to describe the grammars of our programming languages, this ultimately means that the language that we use to express our lexical tokens is less expressive than the language that we use to describe our programming language.

      A→ϡA→aBA→a\bold{A}\rightarrow\epsilon\\ \bold{A}\rightarrow a\bold{B}\\ \bold{A}\rightarrow \bold{a}

    • Type 2 Context-Free Grammars

      • Have the same expressivity as Pushdown Automatons (automaton with a stack)

      A→α\bold{A}\rightarrow\alpha

    • Type 1 Context-Sensitive Grammars

      Ξ²AΞ³β†’Ξ²Ξ±Ξ³\beta\bold{A}\gamma\rightarrow\beta\alpha\gamma

      • You’re able to say β€œYou’re able to replace A\bold{A} with Ξ±\alpha if and only if it is preceded by the sequence Ξ²\beta and proceeded by the sequence Ξ³\gamma”
        • Note here that Ξ²\beta and Ξ³\gamma being Ο΅\epsilon (sequence of empty symbols) is allowed.
    • Type 0 Unrestricted Grammars

      • In an unrestricted grammar, the only restriction is that Ξ±β‰ Ο΅\alpha\ne\epsilon
      • Have the same expressivity as a Turing Machine - we could describe any program as an Unrestricted Grammar.

      Ξ±β†’Ξ²\alpha\rightarrow\beta