Unit 8 of 1015 min read

Theory of Computation & Compilers

Regular and context-free languages, Turing machines, undecidability, parsing, semantic analysis, code generation.

Share:WhatsAppLinkedIn

Why this unit matters

TOC and Compilers is theory-heavy but extremely predictable at NET level. Pumping lemma proofs, grammar type identification, parser first/follow set calculations, and cyclomatic complexity appear almost every session. The unit has a natural layered structure: automata theory feeds directly into compiler front-end design, so understanding DFAs makes LL(1) parsing intuitive rather than mechanical.


Syllabus map

Area Key topics
Foundations Formal language, diagonal argument, Russell's paradox, non-computational problems
Regular languages DFA, NFA, epsilon-NFA, equivalence, regular grammar, regex, pumping lemma, lexical analysis
Context-free languages CFG, PDA, NPDA, CNF, GNF, parse trees, ambiguity, pumping lemma for CFLs
Turing machines Standard TM, variants, UTM, Church-Turing thesis, recursive / r.e. languages
Chomsky hierarchy Type 0, 3, unrestricted, context-sensitive, context-free, regular
Unsolvability Halting problem, PCP, classification, tractable vs intractable
Syntax analysis Top-down, LL(1), recursive descent, bottom-up, SLR, LALR(1), LR(1)
Semantic analysis Attribute grammar, SDDs, dependency graph, S/L-attributed, type checking
Run-time Activation record, symbol table, parameter passing, stack allocation
Code generation Three-address code, DAG, basic blocks, code optimisation

Formal language foundations

Alphabets, strings, languages

An alphabet (sigma) is a finite set of symbols. A string is a finite sequence of symbols. A language is a (possibly infinite) set of strings over sigma.

Operations on languages: union, concatenation, Kleene star (L* = union of L^i for i >= 0).

Diagonal argument and undecidability

Cantor's diagonal argument shows that the set of real numbers is uncountable. Applied to computation: the set of all Turing machines is countable (they can be encoded as strings), but the set of all languages is uncountable -> some languages have no TM recognising them.

Russell's paradox, the set of all sets that do not contain themselves leads to a contradiction; motivates formal axioms for set theory. In TOC context, it illustrates self-referential undecidability.


Regular languages

DFA

A DFA is a 5-tuple (Q, sigma, delta, q0, F): states Q, alphabet sigma, transition function delta: Q x sigma -> Q, start state q0, accepting states F.

A string w is accepted if the DFA, starting in q0, ends in an accepting state after reading all of w.

NFA and epsilon-NFA

An NFA allows multiple transitions on the same symbol and epsilon-transitions. Every NFA has an equivalent DFA (subset construction). The DFA may have up to 2^n states for an n-state NFA, but the number of reachable states is often much less.

Regular expressions

Constructed from: empty set, epsilon, single symbols, union (|), concatenation, Kleene star (*). Every regex defines a regular language and vice versa.

Equivalence results

DFA = NFA = epsilon-NFA = Regular Grammar = Regular Expression. All define exactly the class of regular languages.

A regular grammar is either right-linear (A -> aB | a | epsilon) or left-linear. A right-linear grammar corresponds directly to an NFA.

Pumping lemma for regular languages

If L is regular, there exists a pumping length p such that any string w in L with |w| >= p can be split as w = xyz where:

  1. |xy| <= p
  2. |y| >= 1
  3. For all i >= 0, xy^i z is in L

Use it to prove languages are not regular (proof by contradiction). You cannot use the pumping lemma to prove a language IS regular.

Classic example: L = {a^n b^n | n >= 0} is not regular. Choose w = a^p b^p. Then xy must be entirely within the a's, so y = a^k (k >= 1). Pumping gives a^(p+k) b^p which is not in L. Contradiction.

Lexical analysis

The lexer (scanner) is the first phase of a compiler. It reads the source character by character and produces tokens. Tokens are specified by regular expressions; the lexer is implemented as a DFA.

Token types: keywords, identifiers, literals, operators, punctuation.

A lexer generator (like lex/flex) takes regex rules and generates a DFA automatically.


Context-free languages

Context-free grammar (CFG)

A CFG is (V, T, P, S): non-terminals V, terminals T, productions P (of the form A -> alpha where A is a non-terminal and alpha is a string of terminals and non-terminals), start symbol S.

Pushdown automaton (PDA)

A PDA is an NFA augmented with a stack. PDAs recognise exactly the context-free languages. Deterministic PDAs (DPDA) recognise a proper subset: deterministic CFLs.

Chomsky Normal Form (CNF)

Every CFG can be converted to CNF where all productions are: A -> BC (two non-terminals) or A -> a (one terminal). Used in CYK parsing algorithm (O(n^3) for string recognition).

Conversion steps: eliminate epsilon-productions, eliminate unit productions, eliminate useless symbols, then convert remaining productions.

Greibach Normal Form (GNF)

Every production starts with a terminal: A -> a alpha. Useful for constructing PDA from grammar.

Ambiguity

A grammar is ambiguous if some string has two or more distinct parse trees (equivalently, two distinct leftmost or rightmost derivations). Ambiguity is sometimes inherent (no unambiguous grammar exists for the language, e.g., certain natural language constructs).

Classic ambiguous grammar: E -> E + E | E * E | id. Ambiguous because "id + id * id" has two parse trees. Fix: introduce precedence and associativity via grammar restructuring.

Pumping lemma for CFLs

If L is CFL, there exists p such that any w in L with |w| >= p can be split as w = uvxyz where:

  1. |vxy| <= p
  2. |vy| >= 1
  3. For all i >= 0, uv^i x y^i z is in L

Used to prove languages are not context-free. Example: L = {a^n b^n c^n} is not CFL.


Turing machines

Standard TM

A TM is (Q, sigma, Gamma, delta, q0, B, F): states, input alphabet, tape alphabet (includes blank B), transition function delta: Q x Gamma -> Q x Gamma x {L, R}, start state, blank symbol, accepting states.

A TM can simulate any algorithm, this is the Church-Turing thesis (not a theorem; an hypothesis about computability).

Variants

Multi-tape TM, multiple tapes and read/write heads. Every k-tape TM can be simulated by a single-tape TM (with quadratic slowdown). Nondeterministic TM, can branch into multiple configurations. Every NTM can be simulated by a deterministic TM. Universal TM (UTM), a TM that takes as input the encoding of another TM M and an input w, and simulates M on w.

Recursive and recursively enumerable languages

  • Recursive (decidable), some TM halts and accepts for every string in L, and halts and rejects for every string not in L.
  • Recursively enumerable (r.e.), some TM accepts every string in L; for strings not in L the TM may reject or loop forever., Co-r.e., complement of an r.e. language. A language is recursive iff both it and its complement are r.e.

Chomsky hierarchy

Type Grammar Automaton Example
Type 0 (unrestricted) alpha -> beta (any) Turing machine All r.e. languages
Type 1 (context-sensitive) alpha A beta -> alpha gamma beta Linear bounded automaton {a^n b^n c^n}
Type 2 (context-free) A -> alpha Pushdown automaton {a^n b^n}
Type 3 (regular) A -> aB or A -> a Finite automaton {a^n}

Each type is a proper subset of the one above it.


Unsolvability

Halting problem

Does TM M halt on input w? This is undecidable (not recursive). Proof: assume a decider H exists; construct a TM D that on input : if H says M halts on , D loops; otherwise D halts. D on input creates a contradiction.

Post Correspondence Problem (PCP)

Given a finite set of domino pairs (top string, bottom string), is there a sequence of dominoes (with repetition) where the concatenation of tops equals concatenation of bottoms? PCP is undecidable.

PCP is used to prove undecidability of CFG properties (e.g., whether two CFGs generate the same language).

Classification

  • Decidable (recursive): membership in regular / CFL, DFA minimisation, CFL emptiness., Undecidable: halting problem, PCP, ambiguity of CFG, equivalence of two CFGs, emptiness of intersection of two CFGs.

Syntax analysis (parsing)

Top-down parsing

Recursive descent, one procedure per non-terminal; may require backtracking. For predictive parsing (no backtracking), the grammar must be LL(1).

LL(1) parsing, Left-to-right scan, Leftmost derivation, 1 lookahead token. Uses a parsing table M[A, a] where A is a non-terminal and a is a terminal (or $).

Computing FIRST and FOLLOW sets:

FIRST(alpha):
  If alpha = epsilon, FIRST = {epsilon}
  If alpha starts with terminal a, FIRST = {a}
  If alpha = A beta, FIRST = FIRST(A) - {epsilon}, plus FIRST(beta) if epsilon in FIRST(A)

FOLLOW(A):
  FOLLOW(S) includes $
  If B -> alpha A beta, add FIRST(beta) - {epsilon} to FOLLOW(A)
  If B -> alpha A or FIRST(beta) contains epsilon, add FOLLOW(B) to FOLLOW(A)

A grammar has no LL(1) conflicts if: for every A -> alpha | beta, FIRST(alpha) and FIRST(beta) are disjoint; and if epsilon in FIRST(alpha), FIRST(beta) and FOLLOW(A) are disjoint.

Grammar transformations for LL(1): eliminate left recursion and apply left factoring.

Left recursion elimination: A -> A alpha | beta becomes A -> beta A', A' -> alpha A' | epsilon.

Bottom-up parsing

Shift-reduce parsing, maintain a stack; either shift the next token onto the stack, or reduce the top of the stack using a production.

LR parsing, Left-to-right scan, Rightmost derivation (in reverse). Uses a table with action (shift/reduce/accept/error) and goto entries.

  • SLR(1), uses FOLLOW sets to resolve conflicts; weaker than full LR.
  • LALR(1), merges states with the same core in LR(1) automaton; same power as LR(1) for most practical grammars; used by yacc/bison.
  • LR(1), most powerful deterministic parser; fewer conflicts than LALR(1).

Parser power: LR(1) >= LALR(1) > SLR(1) > LL(1).


Semantic analysis

Attribute grammar

Attributes associated with grammar symbols. Two kinds:

  • Synthesised attributes, computed from children; passed up the parse tree.
  • Inherited attributes, passed down from parent or siblings.

S-attributed grammar, only synthesised attributes. Can be evaluated bottom-up in a single pass (ideal for LR parsing).

L-attributed grammar, inherited attributes depend only on left siblings and parent. Can be evaluated in a top-down left-to-right pass (compatible with LL parsing).

Syntax-directed definitions (SDD)

An SDD attaches semantic rules to grammar productions. A dependency graph of attributes determines evaluation order (topological sort). If the dependency graph has a cycle, the SDD is not well-defined.

Type checking

The type checker verifies that operations are applied to compatible types. Type systems can be:

  • Static, checked at compile time (C, Java).
  • Dynamic, checked at runtime (Python).
  • Strong, no implicit type coercions that lose information.

Run-time environment

Activation record (stack frame)

Each function call creates an activation record containing: return value, actual parameters, control link (pointer to caller's activation record), access link (pointer to non-local data), saved machine state, local variables, temporaries.

The activation record stack grows on each call and shrinks on return.

// C function call, activation record layout (typical)
// [return address | old frame pointer | local vars | temporaries]
void f(int x) {
    int y = x + 1;   // y is a local variable on the stack
    // ...
}

Parameter passing

  • Call by value, copy of argument is passed; changes do not affect caller.
  • Call by reference, address of argument is passed; changes affect caller.
  • Call by value-result (copy-restore), value is copied in, changes are copied back on return.
  • Call by name, expression is re-evaluated each time the parameter is used (Jensen's device in ALGOL 60).

Symbol table

A data structure mapping identifiers to their attributes (type, scope, memory location). Common implementation: hash table. Scope management: stack of hash tables or a single hash table with scope numbers.


Intermediate code generation and optimisation

Three-address code (TAC)

Instructions of the form: result = operand1 op operand2. Representations: quadruples (op, arg1, arg2, result), triples (op, arg1, arg2, result implied by position), indirect triples.

Translation examples:

// Source: a = b + c * d
// TAC:
t1 = c * d
t2 = b + t1
a  = t2

Basic blocks and control flow graph

A basic block is a maximal sequence of consecutive statements with one entry (first statement), one exit (last statement), and no branches in the middle. Build a CFG by making each basic block a node and adding edges for possible control flow.

Optimisation techniques

Local optimisation (within a basic block):

  • Constant folding: evaluate constant expressions at compile time (3 + 4 -> 7)., Constant propagation: replace a variable with its known constant value., Dead code elimination: remove code whose results are never used., Common subexpression elimination: avoid recomputing the same expression.

Global optimisation (across basic blocks):

  • Data-flow analysis: compute information about what values flow into/out of each basic block., Reaching definitions: which definitions of a variable may reach a given point., Live variable analysis: which variables are live (may be used later) at a given point.

Loop optimisation:

  • Loop-invariant code motion: move computations that do not change inside the loop to before the loop., Strength reduction: replace expensive operations with cheaper ones (e.g., multiplication by addition inside a loop)., Induction variable elimination: remove redundant induction variables.

Peephole optimisation, examine a small window of instructions and replace with shorter/faster equivalents (e.g., remove a store followed immediately by a load of the same variable).


Worked examples

Example 1: Pumping lemma, proving non-regularity

Claim: L = {ww | w in {a, b}*} is not regular.

Choose pumping length p. Let s = a^p b^p a^p b^p (which is in L with w = a^p b^p).

|s| = 4p >= p. By the lemma, s = xyz with |xy| <= p and |y| >= 1.

Since |xy| <= p, both x and y fall entirely within the first block of a's. So y = a^k (1 <= k <= p).

Pump with i=2: xy^2z = a^(p+k) b^p a^p b^p. For this to be in L, we need it to equal vv for some v. The total length is 4p+k. For vv, v has length (4p+k)/2 which is not an integer when k is odd, or the two halves do not match. Contradiction. So L is not regular.

Example 2: First and Follow sets

Grammar: E -> T E', E' -> + T E' | epsilon, T -> F T', T' -> * F T' | epsilon, F -> ( E ) | id.

FIRST(F) = {(, id}. FIRST(T') = {*, epsilon}. FIRST(T) = FIRST(F) = {(, id}. FIRST(E') = {+, epsilon}. FIRST(E) = FIRST(T) = {(, id}. FOLLOW(E) = {$, )}. FOLLOW(E') = FOLLOW(E) = {$, )}. FOLLOW(T) = FIRST(E') - {epsilon} union FOLLOW(E') = {+, $, )}.

Example 3: LR(0) item sets

Grammar: S -> A A, A -> a A | b. Construct canonical LR(0) items to build the SLR(1) parsing table.

Item set I0: S' -> . S, S -> . A A, A -> . a A, A -> . b. On reading 'a': goto I1 with A -> a . A (and A -> . a A, A -> . b). On reading 'b': goto I2 with A -> b. (reduce state). On reading A: goto I3 with S -> A . A, A -> . a A, A -> . b. And so on until all items are closed.

Example 4: Three-address code for a while loop

// Source
while (a < b) {
    a = a + 1;
}

// TAC
L1: if a >= b goto L2
    t1 = a + 1
    a  = t1
    goto L1
L2: ...

Example 5: Loop-invariant code motion

// Before optimisation
for (int i = 0; i < n; i++) {
    int x = a * b;    // a and b do not change in the loop
    arr[i] = x + i;
}

// After hoisting loop-invariant computation
int x = a * b;        // moved outside the loop
for (int i = 0; i < n; i++) {
    arr[i] = x + i;
}

Quick-revision summary

  • Regular: DFA = NFA = regex = right-linear grammar. Pumping lemma: |xy| <= p, |y| >= 1., CFL: CFG = PDA. Pumping: |vxy| <= p, |vy| >= 1. CNF: A -> BC | a. GNF: A -> a alpha., Chomsky hierarchy: Type 3 < Type 2 < Type 1 < Type 0., Halting problem is undecidable. PCP is undecidable., LL(1): predictive top-down, uses FIRST/FOLLOW, no left recursion, no ambiguity., LR(1) >= LALR(1) > SLR(1) > LL(1) in language class., Synthesised attributes: computed from children (S-attributed, bottom-up evaluation)., Activation record: return address, parameters, locals, saved registers., Loop-invariant code motion: hoist code that does not change inside the loop., Peephole: small window replacements; loop-invariant motion: move to pre-loop; strength reduction: multiply -> add.

How to study this unit

  1. Draw the Chomsky hierarchy on a card and be able to give an example language and automaton for each type.
  2. Practice pumping lemma proofs for at least 3 non-regular and 2 non-CFL languages. The structure of the proof is always the same once you learn it.
  3. For syntax analysis, build the FIRST and FOLLOW sets for 2, 3 grammars by hand. Then construct the LL(1) parsing table and verify it has no conflicts.
  4. Understand LR parsing conceptually: items, states, shift vs reduce. You do not need to build full LR(1) tables manually for NET, but SLR(1) is fair game.
  5. For optimisation questions, classify the technique (local, global, loop, peephole) before answering.
  6. Memorise the halting problem proof structure (diagonalisation / self-reference); it appears in various forms.

Test Your Knowledge

Quick MCQ check on this unit

Start Quiz →

Prefer watching over reading?

Subscribe for free.

Subscribe on YouTube