Data Structures & Algorithms

Why this unit matters

This is one of the highest-scoring units in UGC NET CS. Questions range from "what is the worst-case time of quicksort?" to "which data structure supports O(log n) insert and delete?". Mastery here also supports Unit 8 (compilers use symbol tables and parse trees) and any systems programming question. The unit rewards practice: read the theory once, then solve problems.

Syllabus map

Area	Key topics
Linear structures	Arrays, sparse matrix, stacks, queues, priority queues, linked lists
Trees	Binary tree, threaded BST, AVL, B-tree, B+-tree, B*-tree
Others	Sets, graphs, hashing
Sorting/Searching	Bubble, selection, insertion, merge, quick, heap, radix, binary search
Performance	Time/space complexity, asymptotic notation, recurrence relations
Design techniques	Divide-and-conquer, dynamic programming, greedy, backtracking, branch-and-bound
Lower bounds	Comparison trees, reduction-based lower bounds
Graph algorithms	BFS, DFS, Dijkstra, Bellman-Ford, Floyd-Warshall, max flow, Prim, Kruskal
Complexity	P, NP, NP-complete, NP-hard, reducibility
Advanced	Number-theoretic, FFT, string matching, parallel, approximation, randomised

Linear data structures

Arrays and sparse matrix

An array stores elements of the same type in contiguous memory. Access is O(1) by index. A sparse matrix has most elements zero. Storing every element wastes memory; instead, use:

Triplet representation, store (row, col, value) for each non-zero.
Linked list representation, each row is a linked list of non-zero nodes.

For a matrix with n rows, n columns, and t non-zero elements, triplet representation needs 3t cells vs n*n for dense storage.

Stacks

LIFO order. Operations: push, pop, peek, all O(1). Applications: expression evaluation, backtracking, function call management (activation records), undo operations.

Infix to postfix, use the shunting-yard algorithm. Operators are pushed to a stack; when an operator of lower or equal precedence arrives, pop and output until the stack top has lower precedence.

// Pseudocode: infix "A + B * C" -> postfix "A B C * +"
// Read token:
//   if operand -> output
//   if operator -> pop operators of >= precedence, then push
//   if ')' -> pop until '('

Queues and priority queues

Queue: FIFO. Enqueue at rear, dequeue at front, both O(1) with a circular array or linked list.

Priority queue, element with highest (or lowest) priority is dequeued first. Implemented with a binary heap.

A min-heap is a complete binary tree where parent <= children. Insert: add at end, sift up, O(log n). Delete-min: remove root, place last element at root, sift down, O(log n). Build-heap from n elements: O(n) (not O(n log n), a common exam trap).

Linked lists

Singly, doubly, circular. Insertion/deletion at a known position: O(1). Search: O(n). No random access.

Doubly linked list supports O(1) deletion given a pointer to the node (no need to traverse to find the previous node).

Trees

Binary search tree (BST)

Search, insert, delete: O(h) where h = height. For a balanced tree h = O(log n); for a skewed tree h = O(n).

Inorder traversal of BST gives sorted order, a key exam fact.

AVL tree

A self-balancing BST where the balance factor (BF) = height(left subtree), height(right subtree) must be in {-1, 0, +1} at every node.

On insertion or deletion, rebalance using rotations:

LL case -> single right rotation, RR case -> single left rotation, LR case -> left rotation on left child, then right rotation on root, RL case -> right rotation on right child, then left rotation on root

All operations: O(log n).

Height of an AVL tree with n nodes: at most 1.44 log2(n+2), 0.328.

Threaded binary tree

In a standard binary tree, about half the pointers are NULL. In a threaded tree, NULL left pointers point to the inorder predecessor and NULL right pointers point to the inorder successor. This enables inorder traversal without a stack.

B-tree (order m)

Used in databases and file systems. Properties:

Every node has at most m children., Every non-leaf non-root node has at least ceil(m/2) children., All leaves are at the same level., A node with k children has k-1 keys.

B+ tree, all data records are at leaf level; internal nodes store only keys for routing. Leaves are linked (supports range queries efficiently). Used in most RDBMS indexes.

B tree*, B+ tree variant where non-root nodes must be at least 2/3 full (vs 1/2 full in B+).

Hashing

Hash function maps a key to an index in range [0, m-1].

Collision resolution

Open addressing (closed hashing), on collision, probe for the next empty slot., Linear probing: h(k, i) = (h'(k) + i) mod m, primary clustering., Quadratic probing: h(k, i) = (h'(k) + c1i + c2i^2) mod m, secondary clustering., Double hashing: h(k, i) = (h1(k) + i*h2(k)) mod m, best distribution.

Separate chaining (open hashing), each slot holds a linked list of colliding elements. Load factor alpha = n/m. Expected search time = O(1 + alpha).

For open addressing, successful search expected time ≈ (1/alpha) * ln(1/(1-alpha)).

Sorting and searching

Algorithm	Best	Average	Worst	Space	Stable?
Bubble	O(n)	O(n^2)	O(n^2)	O(1)	Yes
Selection	O(n^2)	O(n^2)	O(n^2)	O(1)	No
Insertion	O(n)	O(n^2)	O(n^2)	O(1)	Yes
Merge	O(n log n)	O(n log n)	O(n log n)	O(n)	Yes
Quick	O(n log n)	O(n log n)	O(n^2)	O(log n)	No
Heap	O(n log n)	O(n log n)	O(n log n)	O(1)	No
Radix	O(nk)	O(nk)	O(nk)	O(n+k)	Yes

Quicksort worst case occurs when the pivot is always the smallest or largest element (already sorted input with naive pivot). Randomised quicksort avoids this in expectation.

Binary search on a sorted array: O(log n). The comparison tree for binary search has height ceil(log2(n+1)).

Performance analysis

Asymptotic notation

O (Big-Oh), upper bound: f(n) = O(g(n)) means there exist c, n0 such that f(n) <= c*g(n) for all n >= n0.
Omega, lower bound: f(n) = Omega(g(n)) means f(n) >= c*g(n).
Theta, tight bound: f(n) = Theta(g(n)) means both O and Omega hold.
o (little-oh), strict upper bound (the ratio f(n)/g(n) -> 0).

Solving recurrences

Master theorem for T(n) = a*T(n/b) + f(n):

If f(n) = O(n^(log_b a, epsilon)) -> T(n) = Theta(n^(log_b a)), If f(n) = Theta(n^(log_b a)) -> T(n) = Theta(n^(log_b a) * log n), If f(n) = Omega(n^(log_b a + epsilon)) and regularity condition holds -> T(n) = Theta(f(n))

Merge sort: T(n) = 2T(n/2) + n. Here a=2, b=2, log_b a = 1, f(n) = n = Theta(n^1). Case 2 -> T(n) = Theta(n log n).

Binary search: T(n) = T(n/2) + 1. a=1, b=2, log_b a=0, f(n)=1=Theta(n^0). Case 2 -> T(n) = Theta(log n).

Algorithm design techniques

Divide and conquer

Split problem into sub-problems, solve recursively, combine results.

Examples: Merge sort, Quick sort, Binary search, Strassen's matrix multiplication (T(n) = 7T(n/2) + 18n^2 -> O(n^2.81) vs O(n^3) for naive).

// Merge sort in C
void merge_sort(int a[], int l, int r) {
    if (l < r) {
        int m = (l + r) / 2;
        merge_sort(a, l, m);
        merge_sort(a, m+1, r);
        merge(a, l, m, r);  // O(n) combine step
    }
}

Dynamic programming

Solve overlapping sub-problems; store results in a table (memoisation = top-down, tabulation = bottom-up).

Principle of optimality: an optimal solution contains optimal solutions to sub-problems.

Examples:

0/1 Knapsack: dp[i][w] = max value using first i items with capacity w. O(nW) time., Longest Common Subsequence: dp[i][j] = LCS length of first i chars of X and first j chars of Y., Floyd-Warshall all-pairs shortest paths: O(V^3)., Matrix chain multiplication: find optimal parenthesisation. O(n^3).

// 0/1 Knapsack (bottom-up)
for (int i = 1; i <= n; i++)
    for (int w = 0; w <= W; w++) {
        dp[i][w] = dp[i-1][w];
        if (wt[i] <= w)
            dp[i][w] = max(dp[i][w], val[i] + dp[i-1][w - wt[i]]);
    }

Greedy

Make the locally optimal choice at each step. Works when greedy choice property and optimal substructure hold.

Examples: Activity selection (sort by finish time, pick non-overlapping), Fractional knapsack (sort by value/weight ratio), Huffman encoding (always merge two lowest-frequency nodes), Dijkstra, Prim, Kruskal.

Greedy does not work for 0/1 knapsack, use DP.

Backtracking

Try a choice; if it leads to failure, undo (backtrack) and try another. State space is explored as a tree.

Examples: N-queens, Sudoku solver, Hamiltonian cycle, subset sum.

Efficiency: prune branches early using bounding functions.

Branch and bound

Like backtracking but uses cost bounds to prune. Maintains a priority queue of live nodes ordered by bound.

Used for: Travelling Salesman Problem (TSP), job scheduling, 0/1 knapsack optimally.

Lower bound theory

The lower bound for comparison-based sorting is Omega(n log n). Proof via comparison tree: a decision tree for n elements has n! leaves; a binary tree with n! leaves has height at least log2(n!) = Theta(n log n) by Stirling's approximation.

Element uniqueness (given n numbers, are any two equal?): Omega(n log n) by reduction from sorting.

Graph algorithms

BFS and DFS

BFS, uses a queue; visits vertices level by level. Finds shortest paths in unweighted graphs. Time: O(V+E).

DFS, uses a stack (or recursion). Classifies edges as: tree, back, forward, cross. Back edges indicate cycles. Time: O(V+E).

Applications of DFS: topological sort (DAG), strongly connected components (Kosaraju's or Tarjan's algorithm), articulation points, bridges.

Shortest paths

Dijkstra, single source, non-negative weights. Uses a min-priority queue. Time: O((V+E) log V) with a binary heap. Greedy: always relax the unvisited vertex with the smallest tentative distance.

Bellman-Ford, single source, handles negative weights. Detects negative weight cycles. Time: O(VE).

Floyd-Warshall, all-pairs shortest paths. DP; O(V^3).

Minimum spanning tree

Kruskal, sort edges by weight; add edge if it doesn't form a cycle (use Union-Find). O(E log E).

Prim, grow the MST from a starting vertex; always add the cheapest edge connecting a tree vertex to a non-tree vertex. O((V+E) log V) with a binary heap.

Both are greedy; both produce a correct MST by the cut property.

Maximum flow

Ford-Fulkerson, find augmenting paths in the residual graph, add flow. Time: O(E * max_flow) with BFS (Edmonds-Karp): O(VE^2).

Max-flow min-cut theorem: the maximum flow equals the minimum cut capacity.

Complexity classes

P, problems solvable in polynomial time by a deterministic Turing machine.

NP, problems whose solutions can be verified in polynomial time. Every P problem is in NP.

NP-complete, problems that are in NP and every NP problem reduces to them in polynomial time. Examples: SAT (Cook-Levin theorem, first NP-complete problem), 3-SAT, Clique, Vertex Cover, Hamiltonian Cycle, TSP (decision), Subset Sum.

NP-hard, at least as hard as NP-complete but not necessarily in NP. TSP (optimisation) is NP-hard.

Reducibility, problem A reduces to B (A <=_p B) means: if B is solvable in poly time, so is A. If A is NP-hard and A <=_p B, then B is NP-hard.

The question P = NP? is unsolved. If P = NP, every NP-complete problem is in P.

Advanced topics

String matching, Naive: O((n-m+1)m). KMP: O(n+m) using a failure function. Rabin-Karp: hashing-based, O(nm) worst case but O(n+m) expected.

FFT, evaluates a degree-(n-1) polynomial at n points in O(n log n). Used for fast polynomial multiplication.

Randomised algorithms, Las Vegas (always correct, random runtime; e.g., randomised quicksort) vs Monte Carlo (random correctness, fixed runtime; e.g., Rabin-Miller primality test).

Approximation algorithms, for NP-hard optimisation problems, produce solutions within a guaranteed ratio of optimal. Example: 2-approximation for vertex cover (take both endpoints of every edge in a maximal matching).

Worked examples

Example 1: AVL insertion

Insert 10, 20, 30 into an empty AVL tree.

After inserting 10: tree is just {10}, balanced. After inserting 20: right child of 10. BF(10) = -1. OK. After inserting 30: right child of 20. BF(10) = -2, BF(20) = -1. RR case -> left rotate at 10. Result: 20 is root, left child 10, right child 30. All balanced.

Example 2: Applying the Master theorem

T(n) = 4T(n/2) + n. a=4, b=2, log_b a = log_2 4 = 2. f(n) = n = O(n^(2-1)) so case 1. T(n) = Theta(n^2).

Example 3: Dijkstra trace

Graph: A-B (weight 4), A-C (weight 2), C-B (weight 1), B-D (weight 5), C-D (weight 8). Source = A.

Initial: dist = {A:0, B:inf, C:inf, D:inf}. Process A: update B=4, C=2. Process C (dist 2): update B=min(4, 2+1)=3, D=min(inf, 2+8)=10. Process B (dist 3): update D=min(10, 3+5)=8. Process D (dist 8). Final: A=0, C=2, B=3, D=8.

Example 4: 0/1 Knapsack

Items: (value=60, weight=10), (value=100, weight=20), (value=120, weight=30). Capacity W=50.

dp table (items x capacity): fill row by row. dp[3][50] = max(dp[2][50], 120 + dp[2][20]) = max(100+60, 120+100) = max(160, 220) = 220. Optimal: items 2 and 3.

Example 5: NP-completeness reduction

To show Hamiltonian Cycle is NP-complete: (1) it is in NP (given a cycle, verify in poly time). (2) Reduce 3-SAT to Hamiltonian Cycle in poly time. Since 3-SAT is NP-complete, Hamiltonian Cycle is NP-hard; combined with NP membership -> NP-complete.

Quick-revision summary

AVL balance factor in {-1, 0, +1}. Four rotations: LL, RR, LR, RL., B-tree order m: max m children, min ceil(m/2) children (except root). All leaves at same level., Build-heap is O(n), not O(n log n)., Merge sort O(n log n) always stable; Quick sort O(n^2) worst case, not stable., DP requires optimal substructure + overlapping sub-problems; Greedy requires greedy choice property., Lower bound for comparison sorting: Omega(n log n)., BFS: shortest path (unweighted); Dijkstra: shortest path (non-negative weights); Bellman-Ford: negative weights; Floyd-Warshall: all-pairs., P subset of NP. NP-complete = NP intersection NP-hard. SAT is the first NP-complete problem., KMP string matching: O(n+m). Naive: O(nm).

How to study this unit

Implement AVL insertion with rotations at least once in C/C++. Understanding the code removes any ambiguity about which rotation applies.
Memorise the sorting table (average/worst case, stability, in-place). Expect 1, 2 direct questions on this.
For DP, practice identifying sub-problems. Write the recurrence before writing code.
Master the Master theorem, it solves 80% of recurrence questions in the exam.
For graph algorithms, trace Dijkstra and BFS on a 5, 6 node example graph until you can do it from memory.
On complexity, know the definitions of P, NP, NP-complete by heart, and be able to name 5 NP-complete problems.