Databases

Why this unit matters

Databases contributes 6-9 marks in GATE CS, one of the highest counts in the entire syllabus. The questions split roughly evenly between theory (normalisation, closure, ER-to-relational mapping) and computation (SQL query output, B+ tree operations, schedule serializability). A student who masters SQL, functional dependencies, and two-phase locking has a near-certain path to 6+ marks. This unit rewards precise, algorithmic thinking far more than rote memorisation.

Syllabus map

ER model: entities, attributes, relationships, weak entities, mapping to relational schema, Relational model: keys, relational algebra (select, project, join, set operations), Tuple relational calculus, SQL: DDL, DML, joins, subqueries, aggregation, views, Integrity constraints: primary key, foreign key, domain, check, Functional dependencies and normal forms (1NF, 2NF, 3NF, BCNF), File organisation: heap, sequential, hash, Indexing: B trees, B+ trees (search, insert, delete), Transactions: ACID properties, schedules, conflict serializability, two-phase locking, timestamp ordering

ER model

Basic concepts

Entity: a real-world object distinguishable from others. Represented as a rectangle.

Attribute types:

Simple vs. composite (address has city, street, zip), Single-valued vs. multi-valued (a person may have multiple phone numbers), Stored vs. derived (age derived from date of birth), Key attribute: uniquely identifies an entity in its entity set

Relationship: an association among entities. A relationship can have attributes (e.g., a WORKS_IN relationship may store the start_date).

Weak entity: has no key of its own; identified by a partial key combined with the key of its owner (strong) entity. Represented with a double rectangle; the identifying relationship uses a double diamond.

Mapping ER to relational schema

Each strong entity set -> one relation; primary key = entity key., Each weak entity set -> one relation; primary key = (owner key + partial key)., Multi-valued attribute -> separate relation with (entity key, attribute value)., Many-to-many relationship -> separate relation with the keys of both entity sets plus any relationship attributes., One-to-many relationship -> can be incorporated by adding the "one" side's key as a foreign key in the "many" relation, or as a separate relation., One-to-one -> can be merged into either entity's relation.

GATE trap: A weak entity's relation always includes the owner's primary key as a foreign key and as part of its own primary key.

Relational algebra

Core operations:

Operation	Symbol	Description
Select	sigma_{condition}(R)	Filters rows
Project	pi_{attrs}(R)	Keeps columns
Union	R union S	All tuples in R or S (schemas must match)
Intersection	R ∩ S	Tuples in both
Difference	R, S	In R but not S
Cartesian product	R x S	All pairs
Natural join	R	><
Theta join	R	><
Division	R / S	Tuples in R related to all tuples in S

GATE trap: Natural join automatically joins on all attributes with the same name. Theta join is more general. Equijoin is a theta join where all conditions are equalities.

Tuple relational calculus

{t | P(t)}: the set of tuples t satisfying predicate P. Uses existential (there exists) and universal (for all) quantifiers.

A safe expression is one that does not produce an infinite result.

SQL

Core SQL syntax

-- Create table with constraints
CREATE TABLE Employee (
    emp_id   INT PRIMARY KEY,
    name     VARCHAR(50) NOT NULL,
    dept_id  INT REFERENCES Department(dept_id),
    salary   DECIMAL(10,2) CHECK (salary > 0)
);

-- Basic query with join
SELECT e.name, d.dept_name
FROM   Employee e
JOIN   Department d ON e.dept_id = d.dept_id
WHERE  e.salary > 50000;

-- Aggregation
SELECT dept_id, AVG(salary) AS avg_sal, COUNT(*) AS num_emp
FROM   Employee
GROUP  BY dept_id
HAVING AVG(salary) > 40000;

-- Subquery: find employees earning more than the average
SELECT name
FROM   Employee
WHERE  salary > (SELECT AVG(salary) FROM Employee);

-- Correlated subquery: departments with at least one employee
SELECT dept_name
FROM   Department d
WHERE  EXISTS (
    SELECT 1 FROM Employee e WHERE e.dept_id = d.dept_id
);

Joins

INNER JOIN: only matching rows., LEFT OUTER JOIN: all rows from left; NULLs for non-matching right., RIGHT OUTER JOIN: symmetric., FULL OUTER JOIN: all rows from both sides., NATURAL JOIN: equijoin on all common column names, eliminates duplicate columns.

GATE trap: NULL is not equal to anything, including NULL. Use IS NULL or IS NOT NULL. Aggregates like COUNT, SUM ignore NULLs.

Functional dependencies and normal forms

Closure of attributes

Given a set of FDs F and an attribute set X, X+ (closure) = the set of attributes functionally determined by X.

Algorithm: start with X+ = X. Repeatedly apply: if A -> B and A is a subset of X+, add B to X+.

A key K is a superkey if K+ = all attributes. A key is a candidate key if no proper subset of K is also a superkey.

Normal forms

1NF: All attributes are atomic (no repeating groups, no multi-valued attributes in a single cell).

2NF: In 1NF, and every non-prime attribute is fully functionally dependent on every candidate key. (No partial dependency, non-key attribute depends on part of a composite key.)

3NF: In 2NF, and no non-prime attribute is transitively dependent on any candidate key. (No A -> B -> C where A is a key and B is not.)

BCNF: For every non-trivial FD X -> Y, X is a superkey. Stricter than 3NF.

3NF decomposition always preserves both lossless join and dependency preservation. BCNF may not preserve FDs.

GATE trap: BCNF is not always dependency-preserving. When GATE asks to decompose into BCNF and also preserve FDs, check carefully, it may be impossible, and 3NF is the answer.

Finding candidate keys

Compute the closure of every subset of attributes. The minimal subsets whose closure is the entire schema are candidate keys.

Attributes that appear only on the right-hand side of FDs are never part of any candidate key (they are always determined, never determining uniquely). Attributes that never appear on any RHS must be in every candidate key.

File organisation and indexing

File organisation

Heap (unordered): fast insertion, slow search O(n)., Sequential (ordered): supports binary search, efficient for range queries, but insertion is expensive., Hash: O(1) search for equality queries; inefficient for range queries.

B trees vs B+ trees

A B tree of order m: each node has at most m children; all nodes (including internal nodes) store records. A B+ tree stores records only in leaf nodes; internal nodes are purely for routing. Leaf nodes are linked in a doubly-linked list.

B+ tree advantages over B tree:

Range queries are efficient (traverse leaf linked list)., Internal nodes can hold more keys (no record pointers), so the tree is shorter., GATE almost exclusively tests B+ trees.

Order of B+ tree: if order = m, each internal node holds at most m-1 keys and m pointers. Each leaf holds at most m-1 key-value pairs.

Minimum fill: internal nodes (except root) have at least ceil(m/2) children. The root has at least 2 children.

GATE trap: After deletion, check if underfilling occurs and whether redistribution (borrow from sibling) or merging (with sibling) is needed. Merging reduces the parent's key count, which can propagate upward.

Transactions and concurrency control

ACID properties

Atomicity: a transaction executes fully or not at all., Consistency: a transaction takes the database from one consistent state to another., Isolation: concurrent transactions do not interfere., Durability: committed changes survive crashes.

Schedules and serializability

A schedule is conflict-serializable if it is equivalent (via swapping non-conflicting operations) to some serial schedule.

Two operations conflict if they access the same data item and at least one is a write.

Test: build the precedence graph. Nodes = transactions; add edge Ti -> Tj if Ti has a conflicting operation before Tj. The schedule is conflict-serializable if and only if the precedence graph is acyclic (no cycle).

Two-phase locking (2PL)

Each transaction has a growing phase (acquires locks, releases none) and a shrinking phase (releases locks, acquires none). The point of transition is the lock point.

Variants:

Basic 2PL: conflict-serializable but may allow cascading rollbacks., Strict 2PL: holds all exclusive (write) locks until commit/abort. Prevents cascading rollbacks; most common in practice., Rigorous 2PL: holds all locks (shared and exclusive) until commit/abort.

GATE trap: 2PL does not prevent deadlock. Deadlock detection (wait-for graph) or deadlock prevention (timestamp-based wound-wait or wait-die) must be added separately.

Timestamp ordering

Each transaction Ti gets a timestamp TS(Ti). For each data item Q, the system maintains read_TS(Q) and write_TS(Q).

If Ti wants to read Q: if TS(Ti) < write_TS(Q), Ti is too old, abort and restart. Else, update read_TS(Q). If Ti wants to write Q: if TS(Ti) < read_TS(Q) or TS(Ti) < write_TS(Q), abort. Else, write and update write_TS(Q).

Produces conflict-serializable schedules equivalent to the serial order by timestamp.

Worked examples

Example 1. Functional dependency closure.

Relation R(A, B, C, D, E). FDs: A -> BC, CD -> E, B -> D, E -> A.

Find candidate keys. Start with A: A+ = {A, B, C, D, E}. A is a superkey. No proper subset: {A} only has A, already minimal. A is a candidate key. Try B: B+ = {B, D}. Not a superkey. Try AB: already know A is a key. Try E: E+ = {E, A, B, C, D} = all attributes. E is a candidate key. Try CD: CD+ = {C, D, E, A, B} = all attributes. CD is a candidate key. Try BC: BC+ = {B, C, D, E, A} = all attributes. BC is a candidate key.

Candidate keys: A, E, CD, BC.

Example 2. SQL query result.

SELECT dept_id, COUNT(*) 
FROM Employee 
WHERE salary > 30000 
GROUP BY dept_id 
HAVING COUNT(*) >= 2;

This returns departments where at least 2 employees earn more than 30,000. The WHERE filters rows before grouping; HAVING filters groups after.

Example 3. Conflict serializability check.

Schedule S: R1(A), R2(B), W1(A), R2(A), W2(B).

Conflict pairs (same item, at least one write):

W1(A) and R2(A): T1 -> T2 (W1 before R2 on A).

Precedence graph: T1 -> T2. No cycle. Conflict-serializable. Equivalent serial schedule: T1, T2.

Example 4. Normalisation to BCNF.

R(A, B, C, D). FDs: A -> B, B -> C, C -> D, D -> A.

Candidate keys: A, B, C, D (all are keys since closure of each reaches all attributes).

All prime attributes. No non-prime attributes exist, R is in BCNF vacuously? No: check each FD. A -> B: A is a superkey (A+ = ABCD). B -> C: B is a superkey. All FDs have superkeys on the left. R is already in BCNF.

Example 5. B+ tree insertion.

B+ tree of order 3 (max 2 keys per node). Insert 1, 4, 7, 10, 17, 21, 31, 25, 19, 20.

After inserting 1, 4, 7: root [1, 4, 7] is overfull (3 keys, max is 2). Split: left leaf [1, 4], right leaf [7], root key [7]. Continue inserting 10 -> [7, 10], 17 -> split: [7, 10], [17], push 17 up. And so on. The final tree structure is left as an exercise, but the key rule is: when a leaf overflows, the middle key is pushed up to the parent.

Quick-revision summary

ER: weak entity identified by (owner key + partial key). Many-to-many -> separate relation., Candidate key: minimal superkey. Closure test is the standard tool., 3NF preserves FDs; BCNF may not. BCNF is stricter (every FD LHS is a superkey)., SQL GROUP BY filters with HAVING; WHERE filters before grouping. NULL != NULL., B+ tree: records only in leaves; leaves linked. Order m -> max m-1 keys per node., Conflict-serializable iff precedence graph is acyclic., 2PL: growing then shrinking. Strict 2PL: hold X-locks till commit. Does NOT prevent deadlock.

How to study this unit

Practice computing attribute closures and finding candidate keys on 15+ examples. This is tested almost every year.
Write SQL queries from English descriptions, not just read them. Practice on a live database if possible.
Decompose relations into BCNF step by step, checking lossless join (using the intersection-in-FD test) at each split.
Simulate B+ tree insertions and deletions on paper using small trees (order 3 or 4).
Build precedence graphs for 10 schedules and test serializability. Practice until it takes under 2 minutes per schedule.
Revise ACID and 2PL variants with a single comparison table before the exam.