Databases

Why this unit matters

Databases in ISRO CS is heavily weighted toward SQL queries, normalisation, and transaction management. ISRO setters frequently ask you to write or evaluate a relational algebra expression, identify the normal form of a given relation, find the candidate keys from a set of functional dependencies, or determine whether a schedule is conflict-serialisable. Indexing calculations (B-tree height, number of I/Os) also appear as numerical questions.

The unit connects to Algorithms (B-tree, hashing used for indexes) and Operating Systems (lock management, concurrency parallels with synchronisation).

Syllabus map

Entity-Relationship model
- Entities, attributes (simple, composite, multi-valued, derived), entity sets
- Relationships, relationship sets, cardinality ratios (1:1, 1:N, M:N), participation (total, partial)
- Weak entities and identifying relationships
- ER to relational schema conversion
Relational model
- Relation, tuple, attribute, domain, schema
- Keys: super key, candidate key, primary key, foreign key
- Relational algebra: select (sigma), project (pi), union, difference, Cartesian product, rename, join (natural, theta, equi, outer joins)
- Tuple relational calculus (TRC) and domain relational calculus (DRC)
SQL
- DDL: CREATE, ALTER, DROP; constraints (PRIMARY KEY, FOREIGN KEY, UNIQUE, NOT NULL, CHECK)
- DML: SELECT, INSERT, UPDATE, DELETE
- Clauses: WHERE, GROUP BY, HAVING, ORDER BY, DISTINCT
- Joins: INNER, LEFT, RIGHT, FULL OUTER, CROSS, SELF
- Subqueries: correlated and non-correlated; EXISTS, IN, ANY, ALL
- Aggregate functions: COUNT, SUM, AVG, MIN, MAX
- Views, indexes in SQL
Functional dependencies and normalisation
- Functional dependency (FD), trivial FD
- Armstrong's axioms: reflexivity, augmentation, transitivity
- Closure of a set of FDs, closure of an attribute set
- Candidate keys: finding all candidate keys from FDs
- Normal forms: 1NF, 2NF, 3NF, BCNF
- Decomposition: lossless join property, dependency preservation
- Higher normal forms: 4NF (multi-valued dependencies), 5NF (briefly)
Indexing
- Dense and sparse indexes
- Primary and secondary indexes
- B-tree and B+ tree: structure, height, search, insert, delete
- Hashing: static and dynamic (extendible hashing)
Transactions and concurrency control
- ACID properties: atomicity, consistency, isolation, durability
- Schedules: serial, serialisable
- Conflict serializability: conflict-equivalent schedules, precedence graph (serialisability graph)
- View serializability
- Recoverable schedules, cascadeless schedules, strict schedules
- Two-phase locking (2PL): basic, strict, rigorous
- Timestamp-based concurrency control
- Deadlock in databases
Recovery
- Log-based recovery: undo logging, redo logging, undo-redo logging
- Checkpointing
- ARIES protocol (brief)

Relational algebra

Key operators

Select (sigma): sigma_{condition}(R). Filters rows. Unary. Project (pi): pi_{A1, A2, ...}(R). Selects columns, removes duplicates. Unary. Union: R union S. Both must be union-compatible (same schema). No duplicates in result. Difference: R - S. Tuples in R but not in S. Cartesian product: R x S. All combinations. If R has r tuples, S has s tuples, result has r*s tuples. Rename: rho_{new_name}(R) or rho_{new_name(A1, A2...)}(R).

Natural join: R |><| S. Equijoin on all common attributes, with duplicate columns removed. Theta join: R |><|{condition} S = sigma{condition}(R x S). Outer joins: left outer, right outer, full outer (preserve non-matching tuples with NULL).

Division: R / S. Tuples in R that are paired with ALL tuples in S. Used for "find students who have taken ALL courses".

Relational algebra to SQL mapping

sigma_{A=5}(R) --> SELECT * FROM R WHERE A = 5. pi_{A,B}(R) --> SELECT DISTINCT A, B FROM R. R |><| S (natural join) --> SELECT * FROM R NATURAL JOIN S (or explicit equijoin).

SQL

Joins

-- Inner join: only matching rows
SELECT e.name, d.dept_name
FROM Employee e INNER JOIN Department d ON e.dept_id = d.dept_id;

-- Left outer join: all employees, even those with no department
SELECT e.name, d.dept_name
FROM Employee e LEFT OUTER JOIN Department d ON e.dept_id = d.dept_id;

Subqueries and EXISTS

-- Correlated subquery: find students who scored above average
SELECT name FROM Student s
WHERE grade > (SELECT AVG(grade) FROM Student WHERE dept = s.dept);

-- EXISTS: find departments with at least one student
SELECT dept_name FROM Department d
WHERE EXISTS (SELECT 1 FROM Student WHERE dept_id = d.dept_id);

GROUP BY and HAVING

SELECT dept_id, COUNT(*) as cnt, AVG(salary) as avg_sal
FROM Employee
GROUP BY dept_id
HAVING AVG(salary) > 50000;

HAVING filters groups (after GROUP BY). WHERE filters rows (before GROUP BY).

ISRO trap: you cannot use aggregate functions in a WHERE clause. They go in HAVING.

Functional dependencies and normalisation

Armstrong's axioms and closure

From a set F of FDs, derive all FDs implied by F using:

Reflexivity: if Y subset X, then X -> Y.
Augmentation: if X -> Y, then XZ -> YZ.
Transitivity: if X -> Y and Y -> Z, then X -> Z.

The closure F+ is the set of all FDs derivable from F.

The closure of attribute set X under F, written X+, is the set of all attributes determined by X.

Algorithm for X+: start with X+ = X. For each FD A -> B in F, if A subset X+, add B to X+. Repeat until no change.

Finding candidate keys

A candidate key is a minimal superkey. To find all candidate keys, find all attribute sets X such that X+ = R (all attributes) and no proper subset of X has this property.

Example: R(A, B, C, D), FDs: A -> B, B -> C, C -> D, D -> A.

A+ = {A, B, C, D} = R. So A is a superkey. Is A minimal? Only one attribute, so yes. A is a candidate key. Similarly, B, C, D are each candidate keys (since B -> C -> D -> A -> B, they all determine everything).

Normal forms

1NF: all attribute values are atomic (no repeating groups, no multi-valued attributes in a single cell).

2NF: in 1NF and no partial dependency of any non-prime attribute on any candidate key. (A partial dependency exists when a non-prime attribute depends on a proper subset of a candidate key.)

3NF: in 2NF and no transitive dependency of any non-prime attribute on a candidate key. (A -> B and B -> C where B is not a key means C transitively depends on A through B.)

BCNF: for every FD X -> Y in F+ (non-trivial), X is a superkey. BCNF is stricter than 3NF. A relation in BCNF is always in 3NF; the converse is not true.

ISRO trap: a relation can be in 3NF but not BCNF. This happens when a non-prime attribute determines a prime attribute (part of a key). BCNF decomposition may not preserve all dependencies.

Lossless join decomposition

A decomposition of R into R1 and R2 is lossless if R1 intersect R2 -> R1 or R1 intersect R2 -> R2 (the common attributes form a superkey of at least one of the two parts).

Indexing

B+ tree

A B+ tree of order p (maximum p pointers per internal node) has:

Internal nodes: at most p pointers, at least ceil(p/2) pointers.
Leaf nodes: store actual data or pointers to data. Leaves are linked together.

Height for n records: at most ceil(log_{ceil(p/2)}(n)).

B+ trees are preferred over B-trees in databases because all data records are in leaves, so range queries only traverse leaf-level links without climbing the tree.

Hashing vs B+ tree

B+ tree: good for range queries and ordered access. O(log n) per operation. Hashing: O(1) average for point queries. Not suitable for range queries.

Transactions and concurrency control

Schedules and conflict serializability

Two operations conflict if: they are from different transactions, they access the same data item, and at least one is a write.

A schedule S is conflict-serialisable if it is conflict-equivalent to some serial schedule. Test using a precedence graph (serialisability graph): create a directed edge T_i -> T_j if T_i has a conflicting operation before T_j's conflicting operation. S is conflict-serialisable if and only if the precedence graph is acyclic.

ISRO trap: conflict serializability implies view serializability, but not vice versa. View-serialisable schedules that are not conflict-serialisable are uncommon but exist.

Two-phase locking (2PL)

Growing phase: transaction acquires locks, does not release any. Shrinking phase: transaction releases locks, does not acquire any.

2PL guarantees conflict serializability. Strict 2PL: release all exclusive locks only after commit/abort (prevents cascading rollbacks). Rigorous 2PL: release all locks (shared and exclusive) only after commit/abort.

2PL can cause deadlock. Deadlock detection uses a wait-for graph.

Recoverable and cascadeless schedules

Recoverable: if T_i reads from T_j, T_j must commit before T_i commits.

Cascadeless (avoids cascading rollbacks): if T_i reads from T_j, T_j must commit before T_i reads.

Strict: if T_i reads or writes a data item written by T_j, T_j must commit or abort before T_i's operation.

Hierarchy (strictest to least): strict subset cascadeless subset recoverable. A serial schedule is strict.

Worked examples

Example 1: Candidate key identification

R(A, B, C, D, E), FDs: AB -> C, C -> D, D -> E.

Find A+: A+ = {A}. AB+ = {A, B, C, D, E} = R. So AB is a superkey. Is AB minimal? A+ = {A} (not R), B+ = {B} (not R). Neither A nor B alone is a superkey. So AB is a candidate key.

Are there other candidate keys? B does not determine A (no FD has B on the left without A), so any key must include A. And with A, we need something to start the chain. Only AB gives us full closure. One candidate key: AB.

Example 2: Checking normal form

R(A, B, C, D), FDs: A -> B, A -> C, C -> D. Primary key: A.

1NF: assume all atomic. Yes. 2NF: no partial dependency (key is A, a single attribute, no proper subset). Yes. 3NF: check for transitive dependencies. A -> C -> D. D is non-prime (not part of any key), and it transitively depends on A via C (C is not a superkey). So R is NOT in 3NF.

Decompose for 3NF: R1(A, B, C), R2(C, D). R1 with FDs A->B, A->C is in 3NF. R2 with C->D is in 3NF (C is the key of R2).

Example 3: SQL query writing

Find names of employees who earn more than all employees in department 10.

SELECT name FROM Employee
WHERE salary > ALL (SELECT salary FROM Employee WHERE dept_id = 10);

Or equivalently:

SELECT name FROM Employee
WHERE salary > (SELECT MAX(salary) FROM Employee WHERE dept_id = 10);

Example 4: Conflict serialisability check

Schedule S: R1(A), R2(B), W1(B), R2(A), W2(A), W1(A).

Conflicts:

W1(B) and R2(B): R2(B) is before W1(B)? No, R2(B) step 2, W1(B) step 3. R2 reads B before T1 writes B. Edge T2 -> T1.
W1(A) (step 6) and W2(A) (step 5): W2 before W1. Edge T2 -> T1.
R2(A) (step 4) and W1(A) (step 6): R2 before W1. Edge T2 -> T1.
W2(A) (step 5) and W1(A) (step 6): already noted.
R1(A) (step 1) and W2(A) (step 5): T1 reads before T2 writes. Edge T1 -> T2.

Precedence graph: T1 -> T2 and T2 -> T1 (cycle). Not conflict-serialisable.

Example 5: B+ tree height calculation

B+ tree with order p = 50 (each internal node has up to 50 pointers). Leaf nodes hold up to 49 key-pointer pairs.

Number of records: n = 1,000,000.

At height h, maximum leaf nodes = (50/2)^(h-1) * (leaves per path). Minimum leaf nodes at height h: each internal node has ceil(50/2) = 25 children (except root with at least 2). Minimum leaves = 2 * 25^(h-2) for h >= 2.

With 1,000,000 records and 49 per leaf, we need at least 1,000,000/49 approx 20,409 leaves. For height 3: minimum leaves = 2 * 25 = 50. Maximum = 50^2 = 2500. Not enough. Height 4: minimum leaves = 2 * 25^2 = 1250, max = 50^3 = 125,000. Not enough for 20,409. Height 4 max 125,000 > 20,409. So height 4 or 5 suffices. Ceiling(log_25(20409)) + 1 = ceiling(log_25(20409)) + 1. log_25(20409) = log(20409)/log(25) = 4.31/1.40 = 3.08. So height 4 is needed.

Quick-revision summary

Relational algebra: sigma (filter rows), pi (select columns), |><| (natural join), / (division for "all" queries).
HAVING filters groups; WHERE filters rows. Aggregate functions not allowed in WHERE.
Candidate key: minimal superkey. Find by computing attribute closure.
2NF: no partial dependencies. 3NF: no transitive dependencies. BCNF: every determinant is a superkey.
BCNF decomposition may lose some FDs. 3NF synthesis preserves all FDs.
Conflict serializability: precedence graph acyclic. View serializability is weaker.
2PL guarantees conflict serializability but can deadlock.
Strict schedules: prevent cascading rollbacks. Strict 2PL ensures strict schedules.
B+ tree preferred over B-tree for databases due to linked leaves (efficient range queries).
Lossless join: common attributes in decomposition form a superkey of at least one part.

How to study this unit

Practise computing attribute closures and finding candidate keys for 5-6 sets of FDs. Time yourself to do it in under 2 minutes per question.
Work through normalisation decomposition: given a relation and its FDs, decompose to 3NF (synthesis algorithm) and BCNF (decomposition algorithm). Verify lossless join and dependency preservation.
Write SQL for "find X who satisfies condition involving Y from another table" type questions. Practise correlated subqueries and EXISTS.
Trace conflict serializability using precedence graphs for 3-4 schedules. Mix serialisable and non-serialisable.
Practise B+ tree height calculation with different orders and record counts. Know the formula for minimum and maximum leaves at each height.
Revise ACID properties and the recoverable/cascadeless/strict hierarchy until you can classify any schedule by inspection.