Compiler Design
Lexical analysis, parsing, syntax-directed translation, runtime, code generation, optimization.
Why this unit matters
Compiler Design contributes 4-7 marks in most GATE CS papers. The questions are almost always from a predictable set: parsing table conflicts, FIRST/FOLLOW computation, SDD evaluation, three-address code generation, and data-flow analyses. Once you learn the mechanical procedures, the marks are straightforward to score. Unlike some theoretical units, Compiler Design questions are procedural, you either know the algorithm or you do not.
Syllabus map
- Lexical analysis (tokens, regular expressions, DFA-based scanners, LEX), Parsing: LL(1), LR(0), SLR(1), LALR(1), CLR(1), Syntax-directed translation (SDT): inherited and synthesised attributes, S-attributed and L-attributed grammars, Runtime environments: activation records, stack frames, static/dynamic scoping, Intermediate code generation: three-address code, quadruples, triples, DAGs, Local optimisation: constant folding, copy propagation, Data-flow analyses: reaching definitions, liveness analysis, constant propagation, available expressions (common sub-expression elimination)
Lexical analysis
The lexical analyser (scanner) reads the source character stream and produces a sequence of tokens. Each token is described by a regular expression (or a DFA derived from it).
Tokens: keyword, identifier, integer literal, operator, delimiter.
The standard tool LEX/FLEX takes a set of rules (pattern, action) and generates a DFA. The priority rule: the first matching pattern wins; among equal-length matches, the rule listed first wins.
GATE trap: Lexical analysis cannot detect errors like mismatched parentheses, that is the parser's job. The scanner only deals with tokens, not structure.
Parsing
FIRST and FOLLOW
FIRST(alpha): the set of terminals that begin strings derivable from alpha. If alpha can derive epsilon, epsilon is in FIRST(alpha).
FOLLOW(A): the set of terminals that can appear immediately to the right of A in some sentential form. FOLLOW always contains $ (end of input) for the start symbol.
Rules:
- If A -> alpha B beta, then FIRST(beta) \ {epsilon} is in FOLLOW(B)., If A -> alpha B, or A -> alpha B beta where epsilon is in FIRST(beta), then FOLLOW(A) is in FOLLOW(B).
LL(1) parsing
A grammar is LL(1) if for every production A -> alpha | beta:
- FIRST(alpha) and FIRST(beta) are disjoint.
- If epsilon is in FIRST(alpha), then FIRST(beta) and FOLLOW(A) are disjoint (and vice versa).
The predictive parsing table has at most one entry per (non-terminal, terminal) cell.
GATE trap: Left-recursive and ambiguous grammars are never LL(1). Left recursion must be eliminated; left factoring removes common prefixes.
LR parsing family
Items (dotted rules) drive LR parsers.
- LR(0): items only, no lookahead. Frequently has shift-reduce or reduce-reduce conflicts even for unambiguous grammars., SLR(1): uses FOLLOW sets to resolve conflicts. More powerful than LR(0)., LALR(1): merges LR(1) states with identical cores. Same number of states as SLR; fewer conflicts than SLR in practice. Used by YACC/Bison., CLR(1) / canonical LR(1): full LR(1) items. Most powerful; fewest conflicts; most states.
Power order: LR(0) < SLR(1) <= LALR(1) <= CLR(1).
GATE trap: A shift-reduce conflict in the action table means the grammar is not in the class for that parser. Ambiguous grammars are never LR(k) for any k.
Syntax-directed translation
Attributes
Synthesised attribute: defined in terms of children's attributes (computed bottom-up). A grammar where all attributes are synthesised is S-attributed.
Inherited attribute: defined in terms of parent's or sibling's attributes (computed top-down). An L-attributed grammar allows both, but inherited attributes of a symbol on the right-hand side can only depend on attributes of symbols to its left.
S-attributed SDDs can be evaluated during bottom-up (LR) parsing. L-attributed SDDs can be evaluated during top-down (LL) parsing.
Translation schemes
A translation scheme embeds semantic actions into production bodies. For L-attributed grammars, actions can appear anywhere in the body. For S-attributed, actions appear at the right end.
GATE trap: A grammar with only synthesised attributes is S-attributed and also L-attributed. S-attributed is a subset of L-attributed.
Runtime environments
An activation record (stack frame) for a function call contains:
- Actual parameters, Return address, Saved registers, Local variables, Temporaries, Access link (for static scoping with nested functions), Control link (dynamic link, pointer to caller's frame)
Static scoping (lexical scoping): the binding of a name is determined by the textual structure of the program. Access links implement this at runtime.
Dynamic scoping: the binding is determined by the call chain. No access links needed; the stack is searched upward for the most recent binding.
Intermediate code generation
Three-address code
Each instruction has at most three addresses (operands + result):
t1 = a + b
t2 = t1 * c
x = t2 - d
Representations:
- Quadruples: (op, arg1, arg2, result), four fields, easy to reorder., Triples: (op, arg1, arg2), result is the instruction number; compact but hard to reorder., Indirect triples: triples plus an index array for order; allows reordering.
DAG for basic blocks
A directed acyclic graph (DAG) for a basic block identifies common sub-expressions. Each unique computation is one node; repeated computations share the node.
Local optimisation and data-flow analyses
Local optimisations (within a basic block)
Constant folding: evaluate constant expressions at compile time. x = 3 + 4 becomes x = 7.
Copy propagation: if x = y, replace subsequent uses of x with y (until x is reassigned).
Dead code elimination: remove assignments whose results are never used.
Data-flow analyses (across basic blocks)
Data-flow analysis works on a control-flow graph (CFG) where nodes are basic blocks and edges represent possible control transfers.
Reaching definitions. A definition d of variable x reaches point p if there is a path from d to p along which x is not redefined. Used for constant propagation and copy propagation.
- gen[B]: definitions generated in block B., kill[B]: definitions killed (overwritten) in block B., IN[B] = union of OUT[P] for all predecessors P., OUT[B] = gen[B] union (IN[B] \ kill[B])., Forward analysis, union meet operator, initialise OUT to empty set.
Liveness analysis. A variable x is live at point p if there is a path from p to a use of x along which x is not redefined. Used for register allocation.
- use[B]: variables used before being defined in B., def[B]: variables defined in B before any use., OUT[B] = union of IN[S] for all successors S., IN[B] = use[B] union (OUT[B] \ def[B])., Backward analysis, union meet operator, initialise IN to empty set.
Available expressions. An expression e is available at point p if every path from the entry to p evaluates e and does not subsequently modify any operand of e. Used for common sub-expression elimination.
- Forward analysis, intersection meet operator, initialise IN to universal set (all expressions).
Constant propagation is a forward analysis that tracks whether variables hold known constant values. It uses a lattice: bottom (not yet seen) < constant values < top (not a constant).
GATE trap: Liveness is a backward analysis; reaching definitions is forward. Confusing direction is a common mistake.
Worked examples
Example 1. Compute FIRST and FOLLOW for the grammar:
E -> T E'
E' -> + T E' | epsilon
T -> F T'
T' -> * F T' | epsilon
F -> ( E ) | id
FIRST(F) = {(, id} FIRST(T') = {*, epsilon} FIRST(T) = FIRST(F) = {(, id} FIRST(E') = {+, epsilon} FIRST(E) = FIRST(T) = {(, id}
FOLLOW(E): {$, )} ($ from start; ) from F -> (E)) FOLLOW(E'): same as FOLLOW(E) = {$, )} FOLLOW(T): FIRST(E') \ {epsilon} union FOLLOW(E') = {+, $, )} FOLLOW(T'): same as FOLLOW(T) = {+, $, )} FOLLOW(F): FIRST(T') \ {epsilon} union FOLLOW(T') = {*, +, $, )}
Example 2. How many states does the canonical LR(1) parser have for a grammar with 3 LR(0) states that each split into 2 LR(1) states?
6 states. (LALR(1) merges these back to 3 states.)
Example 3. Liveness analysis on a basic block.
1: a = 1
2: b = 2
3: c = a + b
4: d = c * 2
5: return d
Working backward from line 5: d is live before line 5. Before line 4: c is live (d is computed here). Before line 3: a and b are live. Before line 2: a is live (b defined here, not used before). Before line 1: nothing is live (a defined here).
Dead code: none. All definitions are used.
Example 4. Three-address code for x = (a + b) * (a + b).
Without CSE:
t1 = a + b
t2 = a + b
x = t1 * t2
With CSE (DAG identifies a+b as common):
t1 = a + b
x = t1 * t1
One instruction saved.
Example 5. Reaching definitions for a two-block CFG.
Block B1: x = 1 (definition d1), y = 2 (d2). Block B2: x = 3 (d3), z = x + y (d4). B2 follows B1.
gen[B1] = {d1, d2}, kill[B1] = {d3} (d3 defines x, same variable as d1), OUT[B1] = {d1, d2}. IN[B2] = OUT[B1] = {d1, d2}. gen[B2] = {d3, d4}, kill[B2] = {d1}. OUT[B2] = {d3, d4} union ({d1, d2} \ {d1}) = {d2, d3, d4}.
At the use of x in d4, reaching definitions of x = {d3}. So x is known to be 3 (constant propagation works).
Quick-revision summary
- LL(1): top-down, predictive, table-driven, no left recursion., LR power: LR(0) < SLR(1) <= LALR(1) <= CLR(1). LALR is the practical sweet spot (YACC)., S-attributed: synthesised only, evaluate bottom-up. L-attributed: inherited allowed from left siblings, evaluate top-down., Activation record contents: parameters, return address, locals, temporaries, access link (static), control link (dynamic)., Three-address code: quadruples easiest to optimise; triples most compact., Liveness: backward, union. Reaching definitions: forward, union. Available expressions: forward, intersection., CSE uses available expressions; register allocation uses liveness.
How to study this unit
- Practise FIRST/FOLLOW on at least 10 grammars from scratch. This is a purely mechanical skill that must be fast and error-free.
- Build LR(0) and SLR(1) automata by hand for small grammars (4-5 productions). Check for conflicts explicitly.
- For SDT, practise converting attribute grammars into translation schemes, paying attention to action placement.
- For data-flow, always write out gen/kill sets and the equations before computing. Never skip the formalisation.
- Do all GATE previous-year compiler questions from 2005 to 2024, the question patterns repeat closely.
Prefer watching over reading?
Subscribe for free.