Week 11.1

🏷️ Replace this section with some keywords that can be used for searching, tagging and the indexing of this page.

Sections

🌱 Links to all Heading1 Blocks in this page.

Regular Expressions
Finite Automata (State Machines)

1.0 - Regular Expressions

1.1 - Defining Lexical Tokens using Regular Expressions

In JFlex, we have:
- Lexical tokens are specified via a regular expression
- e.g. an IDENTIFIER token is specified as:
```
{Letter} ({Letter} | {Digit})*
```
- A regular expression is translated into a deterministic finite automaton (DFA) to recognise the tokens
- DFA recognisers are very efficient (O(n) in the length of the input length n)
- JFlex doesn’t directly produce a DFA - it first produces a non-deterministic finite automaton (NFA) which is then translated to a DFA to minimise size.
- The DFA initially produced isn’t as small as it could be, and can be minimised to reduce the size further.
For example, the following NFA (Left) and DFA (Right) are for matching the expression $a\ b\ |\ c$

Untitled

Non-Deteterministic Finite Automata are able to have transitions out of a state on the same symbol (e.g. transitioning out of state 1 to state 2 and 6 on the symbol $\epsilon$ )

1.2 - Regular Expressions Syntax

Regular Expressions are defined with respect to an alphabet (set of symbols) $\Sigma$
🌱 The syntax of regular expressions in BNF is as follows

\ \ \ \ \ \ \ \ \ \ \ \ \ \ \def\rq{\text{'}} \def\or{\ |\ } e::=a\or\varepsilon\or\empty\or e\lq|\rq e\or e\ e\or e^{\lq*\rq} \or \lq(\rq\ e\ \lq)\rq

where a stands for a symbol from $\Sigma$. Repetition has higher precedence than concatenation, which has higher precedence than alternation. $a:$ Symbol from alphabet ($a\in\Sigma)$ $\varepsilon:$ Empty sequence of symbols $\empty:$ Represent matching of no input strings at all.

\def\rq{\text{'}} e\ \lq|\rq\ e:

Choice between two regular expressions $e\ e:$ Concatenation of regular expressions

\def\rq{\text{'}} e^{\lq * \rq}:

Repetition of regular expressions

\def\rq{\text{'}} \lq(\rq\ e\ \lq)\rq:

Grouping of regular expressions - Note that regular expressions use ‘|’ to express alternatives, but is also used to represent alternatives in BNF. Hence the use of quotes around the symbol in the REGEX pattern matching string shown above. ## 1.3 - Language of Regular Expressions - Given an alphabet $\Sigma$, a regular expression defines a language (i.e. a set of strings of symbols from the alphabet). - Let “a” be a symbol, and $e, f$ be regular expressions

\def\rq{\text{'}}
\def\L{\mathcal{L}}
\begin{aligned}
\text{Regular Expression}\ &\ \ \ \ \ \  \text{Language}\\
\L(a)&=\{\lq a\rq\}\\
\L(\varepsilon)&=\{\lq\ \rq\}\\
\L(\empty)&=\{\}\\
\L(e|f)&=\L(e)\cup\L(f)\\
\L(e\ f)&=\L(e)^\frown\L(f)\\
\L(e^*)&=\cup_{n\in\mathbb{N}}(\L(e))^n\\
\L((e))&=\L(e)
\end{aligned}
$$

Note here that the symbol $\def\rq{\text{'}}\lq^\frown\rq$ indicates set concatenation. In this context, it means that we can take any string in language $\mathcal{L}(e)$ and concatenate it with any string in $\mathcal{L}(f)$ to form the new language.
Note here in the second-last definition, the set is essentially formed by concatenating $\mathcal{L}(e)$ with itself $n$ times.
In the last definition, the regular expression of a grouping is just the regular expression of the grouping itself.

1.3.1 - Concatenation of Languages

The concatenation of two languages (sets of strings), $L_1$ and $L_2$ is defined as the set of all strings formed by taking a string $s_1\in L_1$ and $s_2\in L_2$ and concatenating them.
For example:
$\def\rq{\text{'}}\{\lq a\rq,\lq b\rq\}^\frown\{\lq c\rq, \lq d\rq\}=\{\lq ac\rq, \lq ad\rq, \lq bc\rq, \lq bd\rq\}$

1.3.2 - Iteration of Languages

The iteration $L^n$ of a language, $L$ to the power $n$ consists of $L$ concatenated with itself $n$ times.
$\def\rq{\text{'}} \begin{aligned} L^0&=\{\lq\rq\}\\ L^1&=L\\ L^2&=L^\frown L\\ L^3&=L^\frown L^\frown L\\ &\cdots \end{aligned}$
For example, $\def\rq{\text{'}}L(a\ |\ b)=\{\lq a\rq , \lq b\rq\}$ and

\def\rq{\text{'}} {\lq a\rq, \lq b\rq}^0 = {\lq\rq}\ {\lq a\rq, \lq b\rq}^1 = {\lq a\rq, \lq b\rq}\ {\lq a\rq, \lq b\rq}^2={\lq aa\rq, \lq ab\rq, \lq ba\rq, \lq bb\rq}\ {\lq a\rq, \lq b\rq}^3={\lq aaa\rq, \lq aab\rq , \lq aba\rq, \lq abb\rq, \lq baa\rq,\lq bab\rq,\lq bba\rq,\lq bbb\rq}

- Hence:

\def\rq{\text{'}} \begin{aligned} \mathcal{L}((a|b)^*)\ =\ {&\lq\rq, \ &\lq a\rq, \lq b\rq,\ &\lq aa\rq, \lq ab\rq, \lq ba\rq, \lq bb\rq,\ &\lq aaa\rq, \lq aab\rq, \lq aba\rq, \lq abb\rq, \lq baa\rq, \lq bab\rq, \lq bba\rq,\lq bbb\rq\ \ \ \ \ \ \ \ \ \ \ &\cdots} \end{aligned}

\begin{aligned}
&\color{gray}\text{iteration 0 times}\\
&\color{gray}\text{iteration 1 time}\\
&\color{gray}\text{iteration 2 times}\\
&\color{gray}\text{iteration 3 times}\\\\
\end{aligned}

- Note that this language does not include any infinite strings of symbols, because the iterations only generates finite concatenations # 2.0 - Finite Automata (State Machines) - A finite automata is a finite state machine consisting of a finite set of states, with labelled transitions between states - Each transition is labelled with either a symbol $(a\in\Sigma)$ or empty $(\varepsilon)$. - We distinguish between `deterministic` and `non-deterministic` finite automata - For a `deterministic finite automaton` (DFA), empty transitions are not allowed, and for each state and symbol, the next state (if there is one) is uniquely determined. - For a `nondeterministic finite automata` (NFA), empty transitions are allowed and there may be multiple transitions from a state to different next states on the same input symbol. - To create a DFA, we first create a NFA ## 2.1 - Deterministic Finite Automata (Formal) - A DFA, $D$ consists of - A finite alphabet of symbols, $\Sigma$ - A finite set of states, $S$ - A transition function, $T: \Sigma\times S⇸S,S$ which maps an (input) symbol and a (current) state to the (next) state; - The function T may not be defined for all pairs of symbol and state - Since the mapping may not exist for every symbol, state pair, we use the partial mapping symbol $⇸$ - A start state $s_0$ - A set of final (or accepting) states, $F$ --- 🌱 The language of a DFA is the set of strings that it can recognise. - The language $\mathcal{L}(D)$ of a DFA $D$ is the set of finite strings of symbols from $\Sigma$ such that $c\in\mathcal{L}(D)$ if and only if there exists a sequence of states $s_1, s_2, \cdots, s_n$ where $n$ is the length of $c,$ such that:

T(c_1,s_0)=s_1\\
T(c_2,s_1)=s_2\\
\vdots\\
T(c_n, s_{n-1})=s(n)\\
$$

and $s_0$ is the start state and $s_n$ is in the set of final (or accepting) states, $F$ ’
Through the concatenation of the labels $c_1, c_2, \cdots, c_n$ , we should get the string that we’re trying to match.

2.1.1 - DFA Example

The following is a DFA as:
- $\Sigma=\{a,b,c\}$
- $S=\{1, 2, 3, 4\}$

T(a,1)=2\ T(c,1)=4\ T(b,2)=3

- $s_0=1$ as denoted by the arrow pointing into the state (this is convention) - $F=\{3,4\}$ as denoted by the red double circles. ![Untitled](/images/notes/COMP4403/Untitled%202.png) --- 🌱 What language (set of strings) does this DFA match?

\def\rq{\text{'}} \mathcal{L}(DFA)={\lq ab\rq, \lq c\rq}

- In traversing from a start state to a final / accepting state, we can generate the strings “ab” and “c’. - Therefore, the language of the DFA is

\def\rq{\text{'}} {\lq ab\rq, \lq c\rq}

## 2.2 - Nondeterministic Finite Automata (NFA) 🌱 The differences between the formal definition of a DFA and NFA are very similar - the differences are highlighted here in green. - A NFA, N, consists of: - A finite alphabet of symbols, $\Sigma$ - A finite set of states, $S$ - A transition function, $T:(\Sigma\cup\{\varepsilon\})\times S⇸\mathbb{P}S$ which maps an input symbol or the empty string $(\varepsilon)$ and a current state to a set of possible (next) states; The function T may not be defined for all pairs of symbol and state - A start state, $s_0$ - A set of final (or accepting) states --- 🌱 The language of a NFA is the set of strings that it can recognise. This is similar to the definition of a language of a DFA. The differences are highlighted here in green. - The language $\mathcal{L}(D)$ of a NFA $N$ is the set of finite strings of symbols from $\Sigma$ such that $c\in\mathcal{L}(D)$ if and only if there exists a sequence $c'\in(\Sigma\cup\{\varepsilon\})^*$, such that removing all the $\varepsilon$ elements from $c'$ gives $c$ where $n$ is the length of $c',$ such that:

T(c_1,s_0)=s_1\\
T(c_2,s_1)=s_2\\
\vdots\\
T(c_n, s_{n-1})=s(n)\\
$$

and $s_0$ is the start state and $s_n$ is in the set of final (or accepting) states, $F$ ’
Through the concatenation of the labels $c_1, c_2, \cdots, c_n$ , we should get the string that we’re trying to match.
Empty transitions are used to generate the strings, but then omitted from the strings.

2.2.1 - NDA Example

The following is a NDA as:
- $\Sigma=\{a,b,c\}$
- $S=\{1,2,3,4\}$

\begin{aligned} T(\varepsilon,1)&={2,6}\ \ \ \ \ \ \ \ \ \ T(b,3)&={4}\ T(c,6)&={7} \end{aligned} \begin{aligned} T(a,2)={3}\ T(\varepsilon,4}={5}\ T(\varepsilon,7}={5} \end{aligned}

- $s_0=1$ - $F=\{5\}$ ![Untitled](/images/notes/COMP4403/Untitled%203.png) ## 2.3 - Converting a Regular Expression to NFA - The translation of a regular expression to a non-deterministic finite automaton is based on the structure of the regular expression - For each of the syntactic forms of regular expressions given in Definition 1, an NFA is given for that form. - If the syntactic form contains sub-expressions, (e.g. $e\ |\ f$ contains sub-expressions $e$ and $f$) the NFA for that form is constructed from the NFAs for the sub-expressions $e$ and $f$. 1. Single Symbol / Empty Transition 2. Concatenation 3. Alternatives 4. Repetition ### 2.3.1 - NFA for Single Symbol or Empty Transition 🌱 This is the NFA for matching a single symbol $a$ or empty transition $\varepsilon$ - Here, the final (accepting) states are indicated by a double circle. - Here, the topmost NFA matches the string $\def\rq{\text{'}}\lq a\rq$ - Additionally, the bottommost NFA matches the empty string, $\varepsilon$ ![Untitled](/images/notes/COMP4403/Untitled%204.png) ### 2.3.2 - NFA for Concatenation 🌱 This is the NFA template for matching the concatenation of the expressions $e\ f$. 1. Firstly, we transform the expression $e$ into its NFA ![Untitled](/images/notes/COMP4403/Untitled%205.png) 2. We next obtain the NFA for the expression $f$ ![Untitled](/images/notes/COMP4403/Untitled%206.png) 3. Joining the final state of expression $e$’s NFA to the start state of the NFA for expression $f$ gives the NFA for the concatenation of the expressions $e\ f$. - Note here that the final state of the NFA for expression $e$ is no longer a final state in the concatenated NFA. ![Untitled](/images/notes/COMP4403/Untitled%207.png) ### 2.3.3 - NFA for Alternatives 🌱 This is the NFA template for matching the alternative expressions $e\ |\ f$ 1. Firstly, we obtain the NFA for the expression $e$ ![Untitled](/images/notes/COMP4403/Untitled%205.png) 2. We similarly obtain the NFA for the expression $f$ ![Untitled](/images/notes/COMP4403/Untitled%206.png) 3. Joining the two NFAs in the following configuration gives the following larger NFA for alternation ![Untitled](/images/notes/COMP4403/Untitled%208.png) - Note that the final states of the NFAs for both $e$ and $f$are no longer final states in the larger NFA for $e\ |\ f$ ### 2.3.4 - Repetition 🌱 This is the NFA template for matching the repeated expression $e^*$ 1. Firstly, we obtain the NFA for the expression $e$ ![Untitled](/images/notes/COMP4403/Untitled%205.png) 2. We the introduce a new start and end state: - The start state has the option to match the expression $e$ again, or jump to the start state - The old end state may loop back to the old start state, or jump to the new end state: ![Untitled](/images/notes/COMP4403/Untitled%209.png) - We do this so that the repetition construct can match zero or more occurrences. ### 2.3.5 - NFA Example - Repetition and Concatenation 🌱 Construct the NFA for the Regular Expression $a\ b\ |\ c$ We want to construct the NFA for the regular expression $a\ b\ |\ c$. We recognise that the outermost construct is the alternative construct. To construct this NFA, we first need to create the NFAs for the sub-expressions $a\ b$ and $c$ 1. We first construct the NFA for the expression $a\ b$ - we recognise that this is a concatenation, and we first need to construct the NFAs for each of the sub-expressions 1. We first construct the NFA for the expression $a$ ![Untitled](/images/notes/COMP4403/Untitled%2010.png) 2. We the construct the NFA for the expression ![Untitled](/images/notes/COMP4403/Untitled%2011.png) 3. Using the NFA concatenation rules, we join the two sub-NFAs into the NFA for matching the concatenation of expressions $a\ b$ ![Untitled](/images/notes/COMP4403/Untitled%2012.png) 2. We the construct the NFA for the expression $c$ - this is straightforward as we are only creating a NFA that matches a single character. ![Untitled](/images/notes/COMP4403/Untitled%2013.png) Joining these sub-expression NFAs using the alternation rule, we get: ![Untitled](/images/notes/COMP4403/Untitled%2014.png) And, finally numbering the states: ![Untitled](/images/notes/COMP4403/Untitled%2015.png) ### 2.3.6 - Example for Repetition of Alternatives 🌱 Construct the NFA for the Regular Expression $(a\ |\ b)^*$ 1. We first construct the NFA for the expression $a$ ![Untitled](/images/notes/COMP4403/Untitled%2010.png) 2. We the construct the NFA for the expression $b$ ![Untitled](/images/notes/COMP4403/Untitled%2011.png) 3. We then place the NFAs for expression $a$ and $b$ into the NFA for alternatives ![Untitled](/images/notes/COMP4403/Untitled%2016.png) 4. We then place the NFA for the expression $a\ |\ b$ into the NFA for ![Untitled](/images/notes/COMP4403/Untitled%2017.png) ## 2.4 - From Regular Expression to NFA In the translation from a regular expression, $r$ to an NFA, the generated NFA has a few properties that do not necessarily hold for an arbitrary NFA (i.e. one not generated from a regular expression) - The NFA has a single final (accepting) state - The initial state of the NFA only has outgoing transitions - The final state only has incoming transitions The translation rules preserve these properties. ## 2.5 - From NFA to DFA A DFA cannot have - More than one transition leaving a state on the same symbol - Any empty transitions A NFA N can be translated to an equivalent DFA $D$ - Equivalent in the sense that they accept the same languages, i.e. $\mathcal{L}(N)=\mathcal{L}(D)$ - That is, they can match the same sets of strings How is this done? - An NFA is transformed into a DFA in which the labels on the states of the DFA are sets of states from the NFA - The sets of states that label a DFA state are formed by collecting all the states that can be reached from NFA states by empty transitions. ### 2.5.1 - Empty Closure **Empty Closure of a State** - The empty closure of a state $x$ in an NFA $N$ $\varepsilon$-closure$(x,N)$, is the set of states in $N$ that are reachable from $x$ via any number of empty transitions **Empty Closure of a Set of States** - The empty closure of a set of sates $X$ in a NFA $N$, $\varepsilon$-closure$(x,N)$, is the set of states in $N$ that are reachable from any of the states in $X$ via any number of empty transitions:

\varepsilon-\text{closure}(X,N)=\bigcup_{x\in X} \varepsilon-\text{closure}(x,N)
$$

Empty Closure Example

🌱 What is the empty closure of the NFA shown?

$\text{State}\ X$	$\varepsilon-\text{closure}(x, NFA)$
1	$\{1, 2, 6\}$
2	$\{2\}$
3	$\{3\}$
4	$\{4,5\}$
5	$\{5\}$
6	$\{6\}$
7	$\{7, 5\}$

Untitled

From state 1, we can go to state 2 or 6 only transitioning on $\varepsilon$ . Therefore, $\varepsilon-\text{closure}(x,NFA)=\{1, 2, 6\}$
From state 2, we can not go to any other states only transitioning on $\varepsilon$ . Therefore, $\varepsilon-\text{closure}(x,NFA)=\{2\}$
From state 3, we can not go to any other states only transitioning on $\varepsilon$ . Therefore, $\varepsilon-\text{closure}(x,NFA)=\{3\}$
From state 4, we can go to state 5 only transitioning on $\varepsilon$ . Therefore, $\varepsilon-\text{closure}(x,NFA)=\{4,5\}$
From state 5, we can not go to any other states only transitioning on $\varepsilon$ . Therefore, $\varepsilon-\text{closure}(x,NFA)=\{5\}$
From state 6, we can not go to any other states only transitioning on $\varepsilon$ . Therefore, $\varepsilon-\text{closure}(x,NFA)=\{6\}$
From state 7, we can go to state $5$ only transitioning on $\varepsilon$ . Therefore, $\varepsilon-\text{closure}(x,NFA)=\{7, 5\}$

2.6 - Constructing the DFA from the NFA

🌱 The first step in constructing the DFA from a NFA is determining the label of the start state.

The label of the start state of the DFA consists of the set of states containing the start state $s_0$ of the NFA plus all the states in the NFA that are reachable from its start state by one more more empty transitions:
$S_0=\varepsilon-\text{closure}(s_0,N)$
The process for forming a DFA works with a set of unmarked DFA states by selecting an unmarked DFA state and considering all transitions from it on symbols
The initial set of unmarked states contains just $S_0$
- Starting from the initial state, we go through every state that is reachable from the initial state.

The following process is repeated until there are no unmarked DFA states left.

An unmarked DFA state $S$ is selected (the first one is $S_0$ )
For each symbol a (every symbol in the alphabet)
- We consider the set of states that can be reached from any state in S by a transition on a call a;
- call this set of states X
- if X is nonempty, we add a new state to the DFA labelled with $X'=\varepsilon-\text{closure}(X, N)$ , unless a state with that label already exists (in which case it is re-used)
- A transition from $S$ to $X'$ is added to the DFA
The state $S$ is marked as having been processed.

2.6.1 - NFA to DFA Process - Simple Example

🌱 Consider the following NFA and follow the (above) process to turn it into a DFA

Untitled

The start state of the DFA is going to consist of the initial start state, and all of the states that can be reached by state 1 by empty transitions. That is, our new initial stages merges the old states $\{1, 2, 6\}$
We now need to determine what states can be reached from this new state:
- We can go from 2→3 on “a”
- We can go from 6→7 on “c”
On that transition to state 3, we transition to a new state that is state 3 + any state that we can reach from state 3 + empty transitions.
- Since there are no such states, we just leave it as is
- We can transition out of this state on “b”
On that transition to state 7, we transition to a new state that is state 7 + any empty transition.
- We can transition to state 5, therefore, the new state is $\{7, 5\}$
Revisiting the transition out of state 3 on “b”, we can get to state 4
- This state also has a null transition to state 5: $\{4, 5\}$
The states that include $5$ are final or accepting states as 5 was the original accepting state.

2.6.2 - NFA to DFA Process - Example 2

🌱 Consider the following NFA and follow the above process to turn it into a DFA for the REGEX statement for $(b|a)^* a^*$

Untitled

We start from the initial state of the DFA.
- This state contains the NFA’s start state $\{1\}$ , and any state that we can get to via empty transitions $\{2, 3, 5, 8, 9, 11\}$
- Therefore, the empty closure of the start state is $\{1, 2, 3, 5, 8, 9, 11\}$ ${1, 2, 3, 5, 8, 9, 11}$
  - This means that the start state of the DFA merges these states from the NFA
- We now determine the transitions out of this state
  - We have a transition out on $a$ as $9\overset{a}{\rightarrow}10$
  - We have a transition out on $b$ as $3\overset{b}{\rightarrow}4$
  - We have a transition out on $c$ as $5\overset{c}{\rightarrow} 6$
- We now mark the initial state as seen or visited, and move to other states.
We now consider the transition out of the initial state on $a$ $a$ :
- From NFA state 9, we can transition to state 10 on $a$
- From state 10, we can get to state 11 and back to state 9 on empty transitions
- Therefore, the empty closure of this state is $\{10,9,11\}$
We now consider the transition out of the initial state on $b$ $b$ :
- From NFA state 4, we can transition to the following states on empty transitions: $\{4, 7, 2, 3, 5, 8, 9, 11\}$
- From this state, we can transition out on $a$ , from $9\overset a \rightarrow 10$ which takes us to a state containing the empty closure of 10 (which is the state above)
- From this state, we can transition out on $b$ , from $3\overset b \rightarrow 4$ which takes us to a state containing the empty closure of 4 (which is this state)
- From this state, we can transition out on $c$ , from $5\overset c \rightarrow 6$ which takes us to a state containing the empty closure of 6 (which is the state below)
We now consider the transition out of the initial state on $c$ $c$ :
- From NFA state 5, we can transition to state 6 on $c$
- From state 6, we can get to the following states on empty transitions: (this is state 6’s empty closure) $\{6, 7, 2, 3, 5, 8, 9, 11\}$
- From this state, we can transition out on $a$ , from $9\overset a \rightarrow 10$ which takes us to a state containing the empty closure of 10
- From this state, we can transition out on $b$ , from $3\overset b \rightarrow 4$ which takes us to a state containing the empty closure of 4
- From this state, we can transition out on $c$ , from $5\overset c \rightarrow 6$ which takes us to a state containing the empty closure of 6 (which is this state)

Untitled

2.7 - Minimising a DFA

The DFA generated in the above example can be simplified to one with only two states.
To the right is a simplified version of the DFA with the states re-labelled to A, B, C and D
To minimise the DFA, we merge states that have the same transition to equivalent states.
For example, the states A, B and C are equivalent as they all:
- Transition out on $a$ onto $D$
- Transition out on $b$ onto $B$
- Transition out on $c$ onto $C$

Untitled

Therefore, the DFA can be minimised:

Untitled

2.7.1 - Process for Minimising a DFA

Start by partitioning the states into a group of final states, and a group of non-final states.
For example, in the example above, all of the states are final states so we just get the one group
$\{A, B, C, D\}$
For each symbol $x,$ we require all sates in a group $G_1$ to transition to the same group $G_2$ . For the example a transition on $b$ takes states $A, B, C$ back to the same group $\{A, B, C\}$ but there is no transition on $b$ from state $D$ , and hence we must split the group into groups $\{A, B, C\}$ and $\{D\}$
Once that is done for every symbol $x$ , the transitions from all states in each group on $x$ are to the same group, and hence we have finished.

3.0 - Regular Expressions for Lexical Analyser

The lexical analyser generator JFlex makes use of translating regular expressions into DFAs in order to build the scanner
The input to the lexical analyser generator is a list of regular expressions, along with an action to be taken when that token is matched.
The generated lexical analyser matches:
- The longest prefix of the input from the current position that matches any of the regular expressions in the list and
- If the prefix of the input matches more than one regular expression, the action associated with the first matching regular expression in the list is executed.
For PL0, the string “end” followed by a blank matches both:
- The regular expression for the keyword “end” and
- The regular expression for an identifier
The action selected for “end” is that for the keyword “end” because the regular expression for the keyword appears before the regular expression for an identifier
The string “ending” followed by a blank matches only the regular expression for an identifier; it will not be split into the keyword “end” followed by the identifier “ing”
But the string “end ing” will be matched as the keyword “end” followed by the identifier “ing”