Compiler Design

Why this unit matters

Compiler Design in ISRO CS papers leans heavily on parsing theory. Expect questions that ask you to compute FIRST and FOLLOW sets, build an LL(1) parsing table, identify shift-reduce conflicts in an SLR(1) parser, or classify a grammar. Semantic analysis and code generation are asked at a conceptual level. Optimisation questions (constant folding, common-subexpression elimination) appear occasionally.

The unit is self-contained but benefits from understanding formal languages (TOC) for the automata basis of lexers and parsers.

Syllabus map

Phases of a compiler
- Lexical analysis, syntax analysis, semantic analysis, intermediate code generation, optimisation, code generation
- Symbol table: structure, scope management
Lexical analysis
- Tokens, lexemes, patterns
- Regular expressions to NFA to DFA (Thompson construction, subset construction)
- Lex/flex tool overview
Syntax analysis (parsing)
- Context-free grammars, derivations, parse trees
- Ambiguity
- Top-down parsing: recursive descent, predictive parsing, LL(1) grammars
- FIRST and FOLLOW set computation
- LL(1) parsing table construction and conflict detection
- Bottom-up parsing: shift-reduce parsing
- LR(0) items and automaton
- SLR(1), canonical LR(1) (CLR), LALR(1) parsers
- Shift-reduce and reduce-reduce conflicts
Semantic analysis
- Attribute grammars: synthesised and inherited attributes
- S-attributed grammars: only synthesised attributes (suitable for bottom-up parsing)
- L-attributed grammars: synthesised and inherited attributes from left (suitable for top-down)
Intermediate code generation
- Three-address code: forms (x = y op z, x = op y, x = y, goto L, if x goto L)
- Quadruples: (op, arg1, arg2, result)
- Triples: (op, arg1, arg2) with implicit numbering
- Static single assignment (SSA): brief concept
Runtime environments
- Activation records: return address, old frame pointer, local variables, temporaries, actual parameters
- Stack frame layout
- Static and dynamic scope
Code optimisation
- Machine-independent optimisations: constant folding, constant propagation, dead code elimination, common-subexpression elimination (CSE), strength reduction, loop-invariant code motion
- Control flow graph (CFG): basic blocks, edges
- Data flow analysis: reaching definitions, available expressions, live variables
Code generation
- Target architecture basics
- Register allocation: graph colouring concept
- Instruction selection

Lexical analysis

The lexer converts the character stream into a stream of tokens. Each token has a type (IDENTIFIER, INTEGER_LITERAL, KEYWORD, OPERATOR, etc.) and a lexeme (the actual substring).

Regular expressions describe token patterns. These are converted to NFAs (Thompson construction) and then to DFAs (subset construction) for efficient scanning.

ISRO trap: the lexer cannot check context. It recognises "int" as a keyword and "integer" as an identifier, but it cannot tell if an identifier is declared before use (that is a semantic check).

Syntax analysis

FIRST and FOLLOW sets

FIRST(alpha) is the set of terminals that can begin strings derived from alpha. Include epsilon in FIRST(A) if A can derive epsilon.

Rules for FIRST:

FIRST(a) = {a} for terminal a.
FIRST(epsilon) = {epsilon}.
For A -> X1 X2 ... Xk: add FIRST(X1) - {epsilon}. If epsilon in FIRST(X1), add FIRST(X2) - {epsilon}, etc. If all Xi can derive epsilon, add epsilon.

FOLLOW(A) is the set of terminals that can appear immediately to the right of A in some sentential form. $ (end-of-input) is in FOLLOW(start symbol).

Rules for FOLLOW:

Add $ to FOLLOW(S) for start symbol S.
If A -> alpha B beta: add FIRST(beta) - {epsilon} to FOLLOW(B). If epsilon in FIRST(beta), add FOLLOW(A) to FOLLOW(B).
If A -> alpha B: add FOLLOW(A) to FOLLOW(B).

LL(1) grammar

A grammar is LL(1) if the LL(1) parsing table has no cell with more than one entry. Conflicts arise when:

Two or more productions for the same non-terminal A have overlapping FIRST sets, OR
A production A -> alpha has epsilon in FIRST(alpha) and FIRST(alpha) overlaps with FOLLOW(A).

An LL(1) grammar cannot be left-recursive or left-ambiguous. Left recursion must be eliminated, and left-factoring applied before computing FIRST/FOLLOW.

Eliminating left recursion for A -> A alpha | beta: convert to A -> beta A', A' -> alpha A' | epsilon.

LR parsing

LR parsers are bottom-up. They shift tokens onto a stack and reduce using grammar rules.

LR(0) items: a production with a dot indicating how far parsing has progressed. A -> alpha . beta means we have seen alpha and expect beta.

The LR(0) automaton has states (sets of items). Transitions occur on grammar symbols.

SLR(1): uses FOLLOW sets to decide when to reduce. Reduce A -> alpha when the next token is in FOLLOW(A). More powerful than LR(0) but less powerful than CLR(1).

CLR(1) (canonical LR(1)): items carry a 1-token lookahead. Most powerful, but the number of states is large.

LALR(1): merges CLR(1) states with the same LR(0) core (ignoring lookaheads). May introduce reduce-reduce conflicts that CLR(1) does not have, but far fewer states than CLR(1). Most practical parsers (yacc, bison) use LALR(1).

Power order: LR(0) subset SLR(1) subset LALR(1) subset CLR(1).

ISRO trap: LALR(1) and CLR(1) have the same number of shift-reduce conflicts for a grammar that is LALR(1). LALR can introduce reduce-reduce conflicts (relative to CLR) when merging states, but never new shift-reduce conflicts.

Semantic analysis

Attribute grammars

An attribute grammar attaches attributes to grammar symbols. Synthesised attributes: computed from children. Inherited attributes: passed down from the parent or left siblings.

S-attributed grammar: only synthesised attributes. Can be evaluated during bottom-up parsing (as each reduction happens, compute the attribute from child attributes).

L-attributed grammar: each inherited attribute of a symbol in the right-hand side of a production depends only on attributes of symbols to its left in the production, or inherited attributes of the head. Can be evaluated during top-down (or even a single left-to-right scan).

Type checking

The semantic analyser uses the symbol table to check: that variables are declared before use, that operand types match operator expectations, and that function calls match signatures.

Intermediate code

Three-address code uses at most three operands per instruction. Each instruction corresponds roughly to one machine operation.

t1 = b * c
t2 = a + t1

Quadruple representation: (op, arg1, arg2, result). For t1 = b * c: (*, b, c, t1).

Triple representation: (op, arg1, arg2) with implicit numbering. To reference a result, use its triple number. Triples cannot be rearranged freely (indices change).

Optimisation

Basic block and control flow graph

A basic block is a maximal sequence of instructions with no branches in (except at the start) and no branches out (except at the end). A control flow graph (CFG) has basic blocks as nodes and edges for possible flow of control.

Local optimisations within a basic block

Constant folding: evaluate constant expressions at compile time. 3 + 4 becomes 7.

Constant propagation: replace variable uses with known constant values. If x = 5 and later y = x + 2, replace with y = 7.

Common-subexpression elimination (CSE): if an expression is computed multiple times with the same operands and no intervening modification, reuse the first computed result.

t1 = a + b
t2 = a + b  // CSE: replace with t2 = t1

Dead code elimination: remove code whose results are never used.

Strength reduction: replace expensive operations with cheaper equivalents. x * 2 becomes x + x or x << 1. Loop induction variables: i * constant can be replaced by an additive update.

Loop-invariant code motion: move computations that produce the same result on every iteration outside the loop.

Worked examples

Example 1: FIRST and FOLLOW

Grammar: S -> A B A -> a A | epsilon B -> b B | b

FIRST(A): A -> a A, so 'a' in FIRST(A). A -> epsilon, so epsilon in FIRST(A). FIRST(A) = {a, epsilon}. FIRST(B): B -> b B, so 'b'. B -> b, so 'b'. FIRST(B) = {b}. FIRST(S): FIRST(A) - {epsilon} = {a}. Since epsilon in FIRST(A), add FIRST(B) = {b}. So FIRST(S) = {a, b}.

FOLLOW(S): {$}. FOLLOW(A): from S -> A B, add FIRST(B) - {epsilon} = {b}. Epsilon not in FIRST(B). FOLLOW(A) = {b}. FOLLOW(B): from S -> A B, add FOLLOW(S) = {$}. From B -> b B, add FOLLOW(B). FOLLOW(B) = {$}.

Example 2: LL(1) table entry

For A -> epsilon: add A -> epsilon to M[A][f] for each f in FOLLOW(A) = {b}.

For A -> a A: add to M[A][a].

So M[A][a] = {A -> a A}, M[A][b] = {A -> epsilon}. No conflict. The grammar (this part of it) is LL(1)-capable for A.

Example 3: LR(0) items

Grammar (augmented): S' -> S, S -> (S) | x.

Start item: S' -> .S. Closure adds S -> .(S) and S -> .x.

State 0: {S' -> .S, S -> .(S), S -> .x}. Transitions: on S go to state 1 (S' -> S.), on '(' go to state 2 (S -> (.S)), on 'x' go to state 3 (S -> x.).

State 3 has only the complete item S -> x., so it is a reduce state.

Example 4: Three-address code for expression

Expression: a = b * c + b * c (with CSE opportunity).

Without CSE:

t1 = b * c
t2 = b * c
t3 = t1 + t2
a = t3

With CSE (both computations of b*c yield the same result, b and c are not modified between them):

t1 = b * c
t2 = t1 + t1
a = t2

One multiply saved.

Example 5: Activation record layout

Function: int f(int x) { int y; int z; ... }

Activation record (stack frame) layout (typical):

Old frame pointer (saved FP)
Return address
Actual parameter x
Local variable y
Local variable z
Temporaries

The frame pointer (FP) points to a fixed location in the record. Local variables are accessed at negative offsets from FP, parameters at positive offsets (or above the return address, depending on calling convention).

Quick-revision summary

Lexer uses regular expressions (DFA). Parser uses CFG (PDA).
FIRST(A): terminals (and epsilon) reachable from A in one or more derivation steps.
FOLLOW(A): terminals that can follow A in any sentential form; $ for start symbol.
LL(1): no conflicts in parsing table. Requires elimination of left recursion and left-factoring.
LR power: LR(0) < SLR(1) < LALR(1) < CLR(1). LALR(1) is used in practice (yacc, bison).
S-attributed: synthesised only, suitable for bottom-up. L-attributed: also inherited from left, suitable for top-down.
Three-address code: at most 3 operands. Quadruples store result explicitly, triples use implicit numbering.
CSE, constant folding, dead code elimination: local optimisations within a basic block.
Loop-invariant code motion and strength reduction: loop-level optimisations.
Activation record: return address, old FP, parameters, locals, temporaries.

How to study this unit

Practise computing FIRST and FOLLOW for 5-6 different grammars. Include grammars with nullable productions (A -> epsilon) and left recursion (after eliminating it).
Build at least two full LL(1) parsing tables from scratch. Verify by tracing the parse of a sample input string.
Work through LR(0) closure and transition manually for a small grammar (3-4 productions). Label each state and identify shift/reduce/accept actions.
Understand LALR(1) vs CLR(1) at the level of "LALR merges states with same core, which can create reduce-reduce conflicts but not shift-reduce conflicts."
For optimisations, trace a small 3-address code sequence and apply CSE, constant folding, and dead code elimination step by step.
Know the activation record layout well enough to answer "what is at FP-8 in this function" type questions.