Week 1.2

CourseCompilers & Interpreters
SemesterS1 2022

Week 1.2

🏷️ Replace this section with some keywords that can be used for searching, tagging and the indexing of this page.

Sections

🌱 Links to all Heading1 Blocks in this page.

  1. PL0’s Lexical Tokens
  2. Concrete Syntax Trees
  3. Abstract Syntax Trees
  4. Syntax Checking and Type Checking

1.0 - PL0’s Lexical Tokens

1.1 - An Aside on Regular Expressions

  • Lexical tokens are also called terminal symbols (as opposed to non-terminal symbols, terminal symbols are the leaf nodes of a parse tree)
  • Lexical tokens for PL0 are defined in this table by grammar rules which have regular expressions as their right sides
  • A regular expression is either
    • A string in quotes (strictly the concatenation of characters)
    • A sequence of regular expressions representing their concatenation
    • A set of alternative regular expressions separated by β€œ|”
    • A regular expression in parenthesis, β€œ(” and β€œ)” indicating a grouping
    • A regular expression followed by a β€œ*” indicating zero or more occurrences of the construction
  • Repetition has the highest precedence followed by concatenation and then alternation (Aho 1, Sect 3.3), (Louden 4, Section 2.2)

1.2 - List of PL0 Lexical Tokens

  • There are two special tokens:
    1. EOF representing the end of an input file
    2. ILLEGAL representing any illegal token: Any character that doesn’t match the tokens defined below and isn’t part of a comment or whitespace.
  • Note here that the constructs Alpha and Digit are not lexical tokens for PL0; They are abbreviations for decimal digits and alphabetical characters, respectively, used in the definition of the tokens NUMBER and IDENTIFIER
    • DIGIT β†’ β€œ0” | β€œ1” | β€œ2” | β€œ3” | β€œ4”| β€œ5” | β€œ6” | β€œ7” | β€œ8” | β€œ9”
    • ALPHA β†’ β€œa” | β€œb” | β€œc” | β€œd” | β€œe”| β€œf” | β€œg” | β€œh” | β€œi” | β€œj”| β€œk” | β€œl” | β€œm” | β€œn” | β€œo”| β€œp” | β€œq” | β€œr” | β€œs” | β€œt” | β€œu” | β€œv” | β€œw” | β€œx” | β€œy” | β€œz” | β€œA” | β€œB” | β€œC” | β€œD” | β€œE”| β€œF” | β€œG” | β€œH” | β€œI” | β€œJ”| β€œK” | β€œL” | β€œM” | β€œN” | β€œO” | β€œP” | β€œQ” | β€œR” | β€œS” | β€œT” | β€œU” | β€œV” | β€œW” | β€œX” | β€œY” | β€œZ”
    • The REGEX definition of PL0’s NUMBER token is Digit Digit* which essentially means a digit, followed by one or more digits.
    • The REGEX definition of PL0’s IDENTIFIER token is Alpha (Alpha | Digit)* which essentially means an alphanumeric character followed by one or more alphanumeric character or digit.
  • To solve ambiguities in the list of lexical tokens, the lexical analyser
    1. First matches the longest string e.g. β€œ>=” matches as GEQUALS\text{GEQUALS} rather than as β€œ>” (GREATER\text{GREATER}) followed by β€œ=” (EQUALS\text{EQUALS}) and then
    2. Second, matches the earliest of the rules e.g. β€œbegin” matches as KW_BEGIN\text{KW\_BEGIN} rather than as IDENTIFIER\text{IDENTIFIER}, but note that β€œbeginx” matches as identifier as it is no longer a matches the keyword KW_BEGIN\text{KW\_BEGIN}

2.0 - Concrete Syntax Trees

🌱 From the β€œPL0 Compiler Example” PowerPoint.

  • The grammar of the language can be used to extract the structure of the program statement (from its sequence of lexical tokens)
  • The structure of that can be described using a concrete syntax tree.
  • From before, we know that our code can be parsed as the following lexical tokens.
KW_IF, IDENTIFIER("x"), LESS NUMBER(0), KW_THEN, IDENTIFIER("z"), ASSIGN, MINUS, IDENTIFIER("x"), KW_ELSE, IDENTIFIER("z") ASSIGN, IDENTIFIER("x")

2.1 - Parsing the Concrete Syntax Tree

if x<0 then z:= -x else z:=x\text{if x<0 then z:= -x else z:=x}

(1) We start with the Statement node as our root node as we’re defining a PL0 statement.

(2) We know that it’s a conditional statement (an IfStatement) so we add that node to the tree.

  • By the grammar previously defined, an IfStatement is defined as follows

Statement β†’ Assignment | CallStatement | ReadStatement | WriteStatement | WhileStatement | IfStatement | CompoundStatement

Untitled

Untitled

(3) We then expand the IfStatement node using the definitions in our Grammar.

  • Some of these blocks are terminal (as denoted by the boxes with sharp corners) which are the leaf nodes of our Concrete Syntax Tree.

  • These are our lexical tokens.

    IfStatement β†’ KW_IF Condition KW_THEN Statement KW_ELSE Statement

    Untitled

(4) We then expand the Condition node using the definitions in our grammar

Condition β†’ RelCondition

RelCondition β†’ Exp [RelOp Exp]

Untitled

(5) We then expand the Expression nodes and RelOp node based on our grammar definitions

Expression β†’ [PLUS | MINUS] Term {(PLUS | MINUS) Term)

Term β†’ Factor {(TIMES | DIVIDE) Factor}

Factor β†’ LPAREN Condition RPAREN | NUMBER | LValue

LValue β†’ IDENTIFIER

RelOp β†’ EQUALS | NEQUALS | LESS | GREATER | LEQUALS | GEQUALS

Untitled

(6) We then expand the Statement nodes based on our grammar definitions

Statement β†’ Assignment | CallAssignment | ReadStatement | WriteStatement | WhileStatement | IfStatement | CompoundStatment

Assignment β†’ LValue ASSIGN Condition

Untitled

2.1 - Lessons from The Concrete Syntax Tree

  • We can see that when performing an in-order traversal of the leaf nodes, we get our initial tokens in exactly the same order.
  • The Concrete Syntax Tree contains all of the information about the structure of the program
  • This tree contains many nodes with single child nodes - this information is very sparsely packed and we can compact it down a lot.
    • We don’t need all of this information to type-check, interpret or generate code for the program.

3.0 - Abstract Syntax Trees

🌱 We can use Abstract Syntax Trees to represent the same information as in a Concrete Syntax Tree in a denser format.

4_PL0_Compiler_Data_Structures.pdf

  • The Abstract Syntax Tree is represented in the compiler by three main Java classes (and their various subtypes)

    1. DeclNode
    2. StatementNode
    3. ExpNodes
  • The DeclListNode abstract class groups the Abstract Syntax Tree nodes for the procedures into a list - each entry in the list represents a single procedure and includes the procedure’s symbol table entry, and a BlockNode representing it’s body

    DeclNode() with subclasses
        DeclListNode(List decls)
        ProcedureNode(SymEntry.ProcedureEntry procEntry, BlockNode body)
    
  • The StatementNode abstract class represents a statement in the abstract syntax tree: it has a subclass for each kind of statement.

    • ErrorNode caters for erroneous statements.
    • All statement nodes contain a Location which is the location in the source code corresponding to the node.
    • BlockNode represents the body of a procedure (or the main program)
    StatementNode(Location loc) with subclasses
        ErrorNode() # Store the errors in the AST to provide the programmer with 
                    # as much information as possible :)
        BlockNode(DeclListNode procedures, StatementNode body, Scope blockLocals)
        AssignmentNode(ExpNode lvalue, ExpNode e)
        WriteNode(ExpNode e)
        CallNode(String id)
        ListNode(List sl)
        IfNode(ExpNode cond, StatementNode s1, StatementNode s2) # Conditional node
        WhileNode(ExpNode cond, StatementNode s)
        
    
  • Suppose we want to construct the Abstract Syntax Tree of the following expression:

    if x < 0 then z := -x else z := x
    
  • We know that our node is going to be a binary comparison node.

    ExpNode(Location loc, Type t) with subclasses
        ConstNode(int value)
        IdentifierNode(String id)
        BinaryNode(Operator op, ExpNode left, ExpNode right)
    

Untitled

  • We have now populated the BinaryNode with it’s children
  • By constructing the AST we have essentially constructed the CST and summarised the information in a much less verbose manner.

Untitled

Untitled

4.0 - Syntax Checking and Type Checking

4.1 - Syntax Checking

if x < 0, then z := -x else z := x
  • Suppose that we required the above code to be syntactically correct, what are our requirements
    • The condition x < 0 should be a Boolean
    • The statements z := -x and z := x are type correct
  • For the statement z := -x to be type correct, the variable z must be a reference type (must have an allocated spot in memory)
    • We need a variable to have a memory address for it to be assigned a new value
    • The expression -x must evaluate to an expression which is type compatible with type(z)
  • As we perform type-checking, we update our abstract syntax tree, and separately our symbol table with type information
    • For example, to resolve the expression x < 0, we will need to resolve the type of IDENTIFIER(”x”)
    • Therefore, in the type checking phase, we update the abstract syntax tree with type information
    • To resolve the expression z := x, we may need to perform some coercions

4.2 - Type Checking the Abstract Syntax Tree (AST)

  • The IdentifierNode is only used during the parsing phase to represent a reference to either a symbolic constant or a variable.
  • As part of the static semantics (type checking) phase, it will be transformed to either a ConstNode or VariableNode

4.2.1 - Coercions

  • A number of other node types are only introduced in the static semantics phase:

    • A DereferenceNode represents a dereference of a variable address (left value) to access its (right) value
    • NarrowSubrangeNode An expression of a type, T, can be narrowed to a subrange of T (for the purpose of coercing a larger subrange into a smaller subrange)
    • WidenSubrangeNode An expression of a subrange type can be widened to the base type of the subrange
    ExpNode(Location loc, Type t) with subclasses
        DereferenceNode(ExpNode leftVaue)
        NarrowSubrangeNode(ExpNode e)
        WidenSubrangeNode(ExpNode e)
    

  • In this example, we are type checking the abstract syntax tree for the following PL0 code

    if x < 0 then z := -x else z := x
    

Untitled

  1. We first evaluate the expression x < 0 - we need to determine whether the expression has a Boolean value.

    • We’re essentially checking the BinaryNode and its children nodes (as indicated by the orange box)

    • We notice that the BinaryNode specifies the use of LESS_OP which is the less than operator.

      • The less than operator takes two integers and returns a Boolean value, i.e.

        LESS_OP:(intΓ—int)β†’bool\text{LESS\_OP}:(\text{int}\times\text{int})\rightarrow\text{bool}

    • We need to check that both the Identifier β€œx” and the Constant 0 are both integers, or coerce them into Integers if they are not.

      • We resolve the type of the identifier β€œx” by looking it up in our symbol table.
      • By looking up the type of the identifier β€œx” we may see that its type is ref(int) (that is, a reference to an integer)
      • We need to dereference the identifier β€œx” - so we’re actually performing the LESS_OP operation on the dereferenced (reference to) an integer and a constant integer.
    • Therefore, the statement is type correct for its usage within this specific conditional statement.

    • After type checking the expression x < 0 our symbol table looks like this:

    Untitled

  2. We then evaluate the expression z := -x to check whether it’s well formed.

    • The AssignmentNode has a left value of IdentifierNode(”z”) and we need to resolve the identifier node’s reference just as we did for IdentifierNode(”x”) in the step above.

    • By looking up the type of Identifier(”z”) in our symbol table, we may discover that it has type ref(int)

    • We now want to check the right child of the AssignmentNode to see whether it is an expression that evaluates to be the same type - the expression is the minus operator (MINUS_OP) being applied to the identifier β€œx”

      • The MINUS_OP takes an integer and returns an integer (MINUS_OP:intβ†’int\text{MINUS\_OP}:\text{int}\rightarrow\text{int}
      • Therefore, this expression should be well typed as long as the input value we’re passing it is an integer.
      • We’ve already looked up IDENTIFIER(”x”) before and we know it’s an integer variable.
      • It’s not exactly an integer, so we have to coerce it into an integer (this is done by dereferencing the integer).

      Untitled

  3. Likewise, we check the other assignment.

Example Summary

  • Lexical Analysis of the sequence of program characters produced a sequence of lexical tokens
  • Syntax Analysis of the sequence of lexical tokens was used to produced
    • A concrete syntax tree (not actually produced by the compiler)
    • An abstract syntax tree (AST)
  • Static Analysis was used to:
    • Resolve references to identifiers;
    • Type check the abstract syntax tree
    • Update the Abstract Syntax Tree with type coercions

5.0 - Context Free Grammars

5.1 - Writing Context-Free Grammars

  • The following context-free grammar has start symbol E, non-terminals {E,Op}\{E, Op\} and terminals {"(",")",number,"+","βˆ’","βˆ—"}\{"(",")", number, "+", "-", "*"\}

    Eβ†’E Op EEβ†’"(" E ")"Eβ†’numberOpβ†’"+"Opβ†’"βˆ’"Opβ†’"βˆ—"\begin{aligned} E &\rightarrow E\ Op\ E\\ E &\rightarrow "("\ E\ ")"\\ E &\rightarrow \text{number}\\ Op &\rightarrow "+"\\ Op &\rightarrow "-"\\ Op &\rightarrow "*" \end{aligned}

    • Our grammar is defined using a set of symbols - there are two types of symbols
      • Non-Terminal Symbols {E,Op}\{E, Op\}
      • Terminal Symbols {"(",")",number,"+","βˆ’","βˆ—"}\{"(",")", number, "+", "-", "*"\}
        • The characters are tokens, and not syntax in BNF form
      • If not explicitly stated, the start symbol is the left symbol in the very top production
  • This context-free grammar is written in EBNF (Extended Backus-Naur Form)

  • The start symbol of a grammar is the left symbol in the topmost production (if not defined)

5.2 - Definition of Context-Free Grammars

  • A context-free grammar consists of

    • A finite set, βˆ‘\sum, of terminal symbols
    • A finite nonempty set of non-terminal symbols (disjoint from the terminal symbols)
      • Disjoint, not part of another set (i.e. a symbol cannot be both terminal and non-terminal)
    • A finite nonempty set of productions of the form Aβ†’Ξ±A\rightarrow \alpha, where A is a nonempty terminal symbol and Ξ±\alpha is a possibly empty sequence of symbols, each of which is either a terminal or non-terminal symbol

    Non-Terminalβ†’Terminal & Non-Terminal Symbols\color{lightblue}\text{Non-Terminal}\rightarrow\text{Terminal \& Non-Terminal Symbols}

    • A start symbol that must be a non-terminal symbol
  • A grammar defines a language; A language is a set of sentences.

    • Each sentence is a finite sequence of terminal symbols in the grammar
    • From before, our grammar can be (more concisely defined as)

    Eβ†’E Op E βˆ£ "(" E ")" βˆ£ numberOpβ†’"+"∣"βˆ’"∣"βˆ—"E \rightarrow E\ Op\ E\ |\ "("\ E\ ")"\ |\ \text{number}\\ Op\rightarrow"+"|"-"|"*"

🌱 This is a leftmost derivation, as at every step we choose to expand the leftmost non-terminal symbol

  • A leftmost derivation sequence for 3 + 4 is given by:

    • Each sentence in the grammar can be derived from the start symbol, using a series of direct derivation steps.
      • Direct Derivation - replace a non-terminal symbol by the right hand side of one of its productions. Use a double arrow to signify a direct derivation.

    E→E \rightarrow

    • Recognise that E can expand to E Op E, and 3 + 4 is of the form E Op E

      Eβ‡’E Op EE \Rightarrow {\color{lightgreen}E}\ Op\ E

    • Recognise that we can replace the left hand E with its value

      E Op Eβ‡’3 Op EE\ Op\ E\Rightarrow3\ {Op}\ E

    • We now recognise that we can replace the Op with its respective symbol

      E Op Eβ‡’3 Op Eβ‡’3 + EE\ Op\ E\Rightarrow3\ {\color{lightgreen}Op}\ E\Rightarrow3\ {\color{lightgreen}+}\ E

    • And like before, we recognise that we can replace the symbol E with its respective number value

      3+E⇒3+43+E\Rightarrow3+4

  • A leftmost derivation sequence for 3 - 4 - 2 is given by:

    EE

    • We recognise that we could express the sequence as E Op E

    Eβ†’E Op EE\rightarrow E\ Op\ E

    • At this point, notice that we have two options:

      • Expand the leftmost E to (3-4) and the rightmost E to be 2
      • Expand the leftmost E to 3 and the rightmost E to be (4 - 2)
      • Let’s expand the leftmost E to be E Op E to have three productions

      Eβ‡’E Op EEβ‡’E Op E Op EE\Rightarrow {\color{lightgreen}E}\ Op\ E\\ E\Rightarrow E\ Op\ E\ Op\ E\\

    • Recognise that the leftmost E can be expanded to a number (3)

      E Op E Op Eβ‡’3 Op E Op EE\ Op\ E\ Op\ E\Rightarrow3\ {\color{lightgreen}Op}\ E\ Op\ E

    • Recognise that the next leftmost symbol is op - Let’s expand that to the symbol -

      3 Op E Op Eβ‡’3βˆ’E Op E3\ {\color{lightgreen}Op}\ E\ Op\ E\Rightarrow3-E\ Op\ E

    • Recognise that the next leftmost non-terminal symbol is E, expand it to a number, 4

      3βˆ’E Op Eβ‡’3βˆ’4 Op E3-E\ Op\ E\Rightarrow3-4\ Op\ E

    • Recognise that the next leftmost non-terminal symbol is Op, expand it to -

      3βˆ’4 Op Eβ‡’3βˆ’4βˆ’E3-4\ Op\ E\Rightarrow3-4-E

    • Recognise that the next leftmost non-terminal symbol is E, expand it to 2

      3βˆ’4βˆ’Eβ‡’3βˆ’4βˆ’23-4-E\Rightarrow3-4-2

5.3 - Directly Derives (Formal Definition)

  • Let both Ξ±\alpha and Ξ²\beta be possibly empty sequences of terminal and non-terminal symbols, and N a nonterminal symbol

  • If there is a production of the form Nβ†’Ξ³N\rightarrow\gamma, then a derivation step can be applied to a sequence of symbols of the form Ξ±NΞ²\alpha N\beta to replace the N by the sequence Ξ³\gamma to give the sequence Ξ±Ξ³Ξ²\alpha\gamma\beta.

  • We say Ξ±NΞ²\alpha N\beta directly derives Ξ±Ξ³Ξ²\alpha\gamma\beta

    Ξ±NΞ²β‡’Ξ±Ξ³Ξ²\alpha N\beta\Rightarrow\alpha\gamma\beta

5.4 - Sequences of Derivations

🌱 We say that Ξ±\alpha derives Ξ²\beta (i.e. aβ‡’βˆ—Ξ²a \overset{*}{\Rightarrow}\beta) if there is a sequence of direct derivations from Ξ±\alpha to Ξ²\beta (this sequence being {Ξ³0,Ξ³1,⋯ ,Ξ³n}\{\gamma_0, \gamma_1, \cdots,\gamma_n\} where Ξ±=Ξ³0\alpha=\gamma_0 and Ξ³n=Ξ²\gamma_n=\beta)

  • We say a sequence of terminal and nonterminal symbols Ξ±\alpha derives a sequence Ξ²\beta, written

    Ξ±β‡’βˆ—Ξ²\alpha\overset{*}{\Rightarrow}\beta

  • If there is a finite sequence of zero or more direct derivation steps starting from Ξ±\alpha and finishing with Ξ²\beta.

  • That is, there must exist one or more sequences, Ξ³0,Ξ³1,Ξ³2,...,Ξ³n\gamma_0, \gamma_1, \gamma_2, ..., \gamma_n such that,

    Ξ±=Ξ³0β‡’Ξ³1β‡’β‹―β‡’Ξ³n=Ξ²\alpha=\gamma_0\Rightarrow\gamma_1\Rightarrow\cdots\Rightarrow\gamma_n=\beta

  • Note that because zero steps are allowed, Ξ±β‡’βˆ—Ξ±\alpha\overset{*}{\Rightarrow}\alpha for any Ξ±\alpha.

    • For the example above, we have Eβ‡’βˆ—3βˆ’4βˆ’2E\overset{*}{\Rightarrow}3-4-2

5.5 - Nullable

  • A possibly empty sequence of symbols, Ξ±\alpha is nullable if it can derive the empty sequence of symbols, i.e., written as one of the following

    Ξ±β‡’βˆ—Ο΅Ξ±β‡’βˆ—  \alpha\overset{*}{\Rightarrow}\epsilon\\ \alpha\overset{*}{\Rightarrow}\ \

    • Here, we use Ο΅\epsilon to denote the empty sequence of symbols
  • Typically, the derivation Ξ±β‡’βˆ—Ο΅\color{lightblue}\alpha\overset{*}{\Rightarrow}\epsilon is preferred to Ξ±β‡’βˆ—\color{lightblue}\alpha\overset{*}{\Rightarrow} as it may look like we’ve forgotten something.


  • Rules for nullable
    • Ο΅\epsilon is nullable (symbol can derive itself)
    • Any terminal symbol is not nullable (no further derivations for a terminal symbol)
    • A sequence of the form S1S2S3β‹―Sn\color{lightblue}S_1 S_2 S_3\cdots S_n is nullable if all of the constructs S1,⋯ ,Sn\color{lightblue}S_1,\cdots,S_n is nullable.
    • A set of alternatives S1βˆ£β‹―βˆ£Sn\color{lightblue}S_1|\cdots|S_n is nullable if at least one of the alternatives is nullable.
    • EBNF constructs for optionals and repetitions are nullable
    • A nonterminal N is nullable if there is a production for N with a nullable right side.

5.6 - Language Corresponding to a Grammar

The (formal) language L(G)\mathcal{L}(G) corresponding to a grammar, G, is the set of all finite sequences of terminal symbols that can be derived from the start symbol of the grammar using its productions

L(G)={t∈seqβˆ‘βˆ£Sβ‡’βˆ—t}\mathcal{L}(G)=\{t \in\bold{seq}\sum |S\overset{*}{\Rightarrow}t\}

where SS is the start symbol of GG and βˆ‘\sum is its set of terminal symbols.

5.7 - Sentences and Sentential Forms

  • Sentence A sequence of terminal symbols tt such that Sβ‡’βˆ—tS\overset{*}{\Rightarrow}t is called a sentence of the language
  • Sentential Form A sequence of terminal and non-terminal symbols Ξ±\alpha, such that Sβ‡’βˆ—Ξ±S\overset{*}{\Rightarrow}\alpha, is called a sentential form of the language
  • Hence, all sentences are also sentential forms

  • Notice in the derivation shown on the right all of the derivation steps shown in green are Sentential Forms but are not Sentences (as they contain non-terminal symbols.
  • In the derivation, also note that the derivation step in blue is both a Sentence and Sentential Form

Eβ‡’E Op Eβ‡’E Op E Op Eβ‡’3 Op E Op Eβ‡’3βˆ’E Op Eβ‡’3βˆ’4 Op Eβ‡’3βˆ’4βˆ’Eβ‡’3βˆ’4βˆ’2\color{lightgreen}E \Rightarrow E\ Op\ E\\ \Rightarrow E\ Op\ E\ Op\ E\\ \Rightarrow 3\ Op\ E\ Op\ E\\ \Rightarrow 3 - E\ Op\ E\\ \Rightarrow 3-4\ Op\ E\\ \Rightarrow3-4-E\\ \color{lightblue}\Rightarrow3-4-2

6.0 - Language Examples

Eβ†’"("E")"∣aL(E)={a,(a),((a)),(((a))),((((a)))),⋯ }E\rightarrow"(" E")"|a\\ \color{gray}\mathcal{L}(E)=\{a,(a),((a)),(((a))),((((a)))),\cdots\}

  • From the production, we know that the starting symbol is EE

    • The shortest sentence in our language is aa
    • Our language contains an infinite number of sentences, each with finite length

    Aβ†’A a∣ aL(A)={a,aa,aaa,aaaa,⋯ }A\rightarrow A\ a |\ a\\ \mathcal{L}(A)=\{a, aa, aaa, aaaa, \cdots\}

  • We have another infinite language, with an infinite number of sentences (each with finite length)

    Bβ†’B a∣ϡL(B)={Ο΅,a,aa,aaa,aaaa,⋯ }B\rightarrow B\ a|\epsilon\\ \mathcal{L}(B)=\{\epsilon,a,aa,aaa,aaaa,\cdots\}

  • Once again, we have another infinite language, but this time, we can have the empty set.

    C→ϡL(C)={ϡ}C\rightarrow\epsilon\\ \mathcal{L}(C)=\{\epsilon\}

  • This language has no symbols.

    • The language is empty - there are no sentences as there are no derivations with terminal symbols only.
    • Note the distinction between the example above, where the language contains the empty set and this example, where the language doesn’t contain any sentences

    D→"("D")"L(D)={}D\rightarrow"("D")"\\ \mathcal{L}(D)=\{\}

6.1 - Parse Trees

  • A derivation sequence for a string in the language (a sentence) defines a corresponding parse tree
    • A direct derivation step corresponding to the production Nβ†’Ξ±N\rightarrow\alpha generates a parse tree with NN as its root node and the symbols in Ξ±\alpha as the children
    • A derivation sequence generates a parse tree by applying each derivation step to progressively build the parse tree
  • Note that two different derivation sequences may give rise to the same parse tree due to non-terminals being expanded in different orders.

Parse Tree Example

🌱 This is the parse tree for the previous example, where we try to parse 3βˆ’4βˆ’23-4-2

Eβ†’E Op EE\rightarrow E\ Op\ E

Opβ†’ "+"∣"βˆ’"∣"βˆ—"Op\rightarrow\ "+"|"-"|"*"

  • When performing an in-order traversal of all of the leaf nodes, notice that we get 3βˆ’4βˆ’23-4-2
  • As mentioned before, we could expand the - before the first E but that would still yield the same parse tree.
  • We could also expand out the non-terminal terms in a different order, resulting in

Untitled

6.2 - Ambiguous Grammars

🌱 In general, we don’t want our grammars to be ambiguous

  • Ambiguous Grammar for Sequence A grammar, GG, is ambiguous for a sentence, tt, in L(G)\mathcal{L}(G), if there is more than one parse tree for tt
  • Ambiguous Grammar A grammar, GG is ambiguous if it is ambiguous for any sentence in L(G)\mathcal{L}(G)
  • The following grammar is ambiguous:
    • We always start by expanding the starting symbol E to E Op E
    • From there, we could either:
    1. Expand the leftmost E to E Op E (Left associative)
    2. Expand the rightmost E to E Op E (Right associative)

Untitled

  • The following is an ambiguous grammar for expressions with the binary operator β€œ-”, in which N represents a number.

    Eβ†’E "βˆ’" EEβ†’NE\rightarrow E\ "-"\ E\\ E\rightarrow N

πŸ’‘ Key Idea is to add a new non-terminal symbol to break the symmetry of the production

  • To remove the ambiguity and treat β€œ-” as a left-associative operator as usual, we can re-write the grammar to

    • Add a new non-terminal symbol (T) - in this particular configuration, it forces the evaluation of any minuses from left to right as the symbol TT on the right has another layer of expansion to be performed before reaching the terminal symbol.
    • This is as we have to evaluate E before evaluating T

    Eβ†’E "βˆ’" TEβ†’TTβ†’NE\rightarrow E\ "-"\ T\\ E\rightarrow T\\ T\rightarrow N

  • And likewise to treat β€œ-” as a right-associative (not the usual interpretation) we use

    • This works as we have to evaluate E before evaluating T

    Eβ†’T "βˆ’" EEβ†’TTβ†’NE\rightarrow T\ "-"\ E\\ E\rightarrow T\\ T\rightarrow N

Parse Tree Using Left-Associative Unambiguous Grammar

🌱 Creating a parse tree for the 3βˆ’4βˆ’23-4-2 example using left-associative unambiguous grammar

Eβ†’E "βˆ’" TEβ†’TTβ†’NE\rightarrow E\ "-"\ T\\ E\rightarrow T\\ T\rightarrow N

  • Note that the E on the second level (indicated here in orange) must expand to Eβˆ’TE-T as this is the only way that we can have two minuses.
  • Therefore, we expand from the left-hand side first.

Untitled