Databases
ER, relational model, SQL, normalization, transactions, indexing, concurrency control.
Why this unit matters
Databases in ISRO CS is heavily weighted toward SQL queries, normalisation, and transaction management. ISRO setters frequently ask you to write or evaluate a relational algebra expression, identify the normal form of a given relation, find the candidate keys from a set of functional dependencies, or determine whether a schedule is conflict-serialisable. Indexing calculations (B-tree height, number of I/Os) also appear as numerical questions.
The unit connects to Algorithms (B-tree, hashing used for indexes) and Operating Systems (lock management, concurrency parallels with synchronisation).
Syllabus map
-
Entity-Relationship model
- Entities, attributes (simple, composite, multi-valued, derived), entity sets
- Relationships, relationship sets, cardinality ratios (1:1, 1:N, M:N), participation (total, partial)
- Weak entities and identifying relationships
- ER to relational schema conversion
-
Relational model
- Relation, tuple, attribute, domain, schema
- Keys: super key, candidate key, primary key, foreign key
- Relational algebra: select (sigma), project (pi), union, difference, Cartesian product, rename, join (natural, theta, equi, outer joins)
- Tuple relational calculus (TRC) and domain relational calculus (DRC)
-
SQL
- DDL: CREATE, ALTER, DROP; constraints (PRIMARY KEY, FOREIGN KEY, UNIQUE, NOT NULL, CHECK)
- DML: SELECT, INSERT, UPDATE, DELETE
- Clauses: WHERE, GROUP BY, HAVING, ORDER BY, DISTINCT
- Joins: INNER, LEFT, RIGHT, FULL OUTER, CROSS, SELF
- Subqueries: correlated and non-correlated; EXISTS, IN, ANY, ALL
- Aggregate functions: COUNT, SUM, AVG, MIN, MAX
- Views, indexes in SQL
-
Functional dependencies and normalisation
- Functional dependency (FD), trivial FD
- Armstrong's axioms: reflexivity, augmentation, transitivity
- Closure of a set of FDs, closure of an attribute set
- Candidate keys: finding all candidate keys from FDs
- Normal forms: 1NF, 2NF, 3NF, BCNF
- Decomposition: lossless join property, dependency preservation
- Higher normal forms: 4NF (multi-valued dependencies), 5NF (briefly)
-
Indexing
- Dense and sparse indexes
- Primary and secondary indexes
- B-tree and B+ tree: structure, height, search, insert, delete
- Hashing: static and dynamic (extendible hashing)
-
Transactions and concurrency control
- ACID properties: atomicity, consistency, isolation, durability
- Schedules: serial, serialisable
- Conflict serializability: conflict-equivalent schedules, precedence graph (serialisability graph)
- View serializability
- Recoverable schedules, cascadeless schedules, strict schedules
- Two-phase locking (2PL): basic, strict, rigorous
- Timestamp-based concurrency control
- Deadlock in databases
-
Recovery
- Log-based recovery: undo logging, redo logging, undo-redo logging
- Checkpointing
- ARIES protocol (brief)
Relational algebra
Key operators
Select (sigma): sigma_{condition}(R). Filters rows. Unary. Project (pi): pi_{A1, A2, ...}(R). Selects columns, removes duplicates. Unary. Union: R union S. Both must be union-compatible (same schema). No duplicates in result. Difference: R - S. Tuples in R but not in S. Cartesian product: R x S. All combinations. If R has r tuples, S has s tuples, result has r*s tuples. Rename: rho_{new_name}(R) or rho_{new_name(A1, A2...)}(R).
Natural join: R |><| S. Equijoin on all common attributes, with duplicate columns removed. Theta join: R |><|{condition} S = sigma{condition}(R x S). Outer joins: left outer, right outer, full outer (preserve non-matching tuples with NULL).
Division: R / S. Tuples in R that are paired with ALL tuples in S. Used for "find students who have taken ALL courses".
Relational algebra to SQL mapping
sigma_{A=5}(R) --> SELECT * FROM R WHERE A = 5. pi_{A,B}(R) --> SELECT DISTINCT A, B FROM R. R |><| S (natural join) --> SELECT * FROM R NATURAL JOIN S (or explicit equijoin).
SQL
Joins
-- Inner join: only matching rows
SELECT e.name, d.dept_name
FROM Employee e INNER JOIN Department d ON e.dept_id = d.dept_id;
-- Left outer join: all employees, even those with no department
SELECT e.name, d.dept_name
FROM Employee e LEFT OUTER JOIN Department d ON e.dept_id = d.dept_id;
Subqueries and EXISTS
-- Correlated subquery: find students who scored above average
SELECT name FROM Student s
WHERE grade > (SELECT AVG(grade) FROM Student WHERE dept = s.dept);
-- EXISTS: find departments with at least one student
SELECT dept_name FROM Department d
WHERE EXISTS (SELECT 1 FROM Student WHERE dept_id = d.dept_id);
GROUP BY and HAVING
SELECT dept_id, COUNT(*) as cnt, AVG(salary) as avg_sal
FROM Employee
GROUP BY dept_id
HAVING AVG(salary) > 50000;
HAVING filters groups (after GROUP BY). WHERE filters rows (before GROUP BY).
ISRO trap: you cannot use aggregate functions in a WHERE clause. They go in HAVING.
Functional dependencies and normalisation
Armstrong's axioms and closure
From a set F of FDs, derive all FDs implied by F using:
- Reflexivity: if Y subset X, then X -> Y.
- Augmentation: if X -> Y, then XZ -> YZ.
- Transitivity: if X -> Y and Y -> Z, then X -> Z.
The closure F+ is the set of all FDs derivable from F.
The closure of attribute set X under F, written X+, is the set of all attributes determined by X.
Algorithm for X+: start with X+ = X. For each FD A -> B in F, if A subset X+, add B to X+. Repeat until no change.
Finding candidate keys
A candidate key is a minimal superkey. To find all candidate keys, find all attribute sets X such that X+ = R (all attributes) and no proper subset of X has this property.
Example: R(A, B, C, D), FDs: A -> B, B -> C, C -> D, D -> A.
A+ = {A, B, C, D} = R. So A is a superkey. Is A minimal? Only one attribute, so yes. A is a candidate key. Similarly, B, C, D are each candidate keys (since B -> C -> D -> A -> B, they all determine everything).
Normal forms
1NF: all attribute values are atomic (no repeating groups, no multi-valued attributes in a single cell).
2NF: in 1NF and no partial dependency of any non-prime attribute on any candidate key. (A partial dependency exists when a non-prime attribute depends on a proper subset of a candidate key.)
3NF: in 2NF and no transitive dependency of any non-prime attribute on a candidate key. (A -> B and B -> C where B is not a key means C transitively depends on A through B.)
BCNF: for every FD X -> Y in F+ (non-trivial), X is a superkey. BCNF is stricter than 3NF. A relation in BCNF is always in 3NF; the converse is not true.
ISRO trap: a relation can be in 3NF but not BCNF. This happens when a non-prime attribute determines a prime attribute (part of a key). BCNF decomposition may not preserve all dependencies.
Lossless join decomposition
A decomposition of R into R1 and R2 is lossless if R1 intersect R2 -> R1 or R1 intersect R2 -> R2 (the common attributes form a superkey of at least one of the two parts).
Indexing
B+ tree
A B+ tree of order p (maximum p pointers per internal node) has:
- Internal nodes: at most p pointers, at least ceil(p/2) pointers.
- Leaf nodes: store actual data or pointers to data. Leaves are linked together.
Height for n records: at most ceil(log_{ceil(p/2)}(n)).
B+ trees are preferred over B-trees in databases because all data records are in leaves, so range queries only traverse leaf-level links without climbing the tree.
Hashing vs B+ tree
B+ tree: good for range queries and ordered access. O(log n) per operation. Hashing: O(1) average for point queries. Not suitable for range queries.
Transactions and concurrency control
Schedules and conflict serializability
Two operations conflict if: they are from different transactions, they access the same data item, and at least one is a write.
A schedule S is conflict-serialisable if it is conflict-equivalent to some serial schedule. Test using a precedence graph (serialisability graph): create a directed edge T_i -> T_j if T_i has a conflicting operation before T_j's conflicting operation. S is conflict-serialisable if and only if the precedence graph is acyclic.
ISRO trap: conflict serializability implies view serializability, but not vice versa. View-serialisable schedules that are not conflict-serialisable are uncommon but exist.
Two-phase locking (2PL)
Growing phase: transaction acquires locks, does not release any. Shrinking phase: transaction releases locks, does not acquire any.
2PL guarantees conflict serializability. Strict 2PL: release all exclusive locks only after commit/abort (prevents cascading rollbacks). Rigorous 2PL: release all locks (shared and exclusive) only after commit/abort.
2PL can cause deadlock. Deadlock detection uses a wait-for graph.
Recoverable and cascadeless schedules
Recoverable: if T_i reads from T_j, T_j must commit before T_i commits.
Cascadeless (avoids cascading rollbacks): if T_i reads from T_j, T_j must commit before T_i reads.
Strict: if T_i reads or writes a data item written by T_j, T_j must commit or abort before T_i's operation.
Hierarchy (strictest to least): strict subset cascadeless subset recoverable. A serial schedule is strict.
Worked examples
Example 1: Candidate key identification
R(A, B, C, D, E), FDs: AB -> C, C -> D, D -> E.
Find A+: A+ = {A}. AB+ = {A, B, C, D, E} = R. So AB is a superkey. Is AB minimal? A+ = {A} (not R), B+ = {B} (not R). Neither A nor B alone is a superkey. So AB is a candidate key.
Are there other candidate keys? B does not determine A (no FD has B on the left without A), so any key must include A. And with A, we need something to start the chain. Only AB gives us full closure. One candidate key: AB.
Example 2: Checking normal form
R(A, B, C, D), FDs: A -> B, A -> C, C -> D. Primary key: A.
1NF: assume all atomic. Yes. 2NF: no partial dependency (key is A, a single attribute, no proper subset). Yes. 3NF: check for transitive dependencies. A -> C -> D. D is non-prime (not part of any key), and it transitively depends on A via C (C is not a superkey). So R is NOT in 3NF.
Decompose for 3NF: R1(A, B, C), R2(C, D). R1 with FDs A->B, A->C is in 3NF. R2 with C->D is in 3NF (C is the key of R2).
Example 3: SQL query writing
Find names of employees who earn more than all employees in department 10.
SELECT name FROM Employee
WHERE salary > ALL (SELECT salary FROM Employee WHERE dept_id = 10);
Or equivalently:
SELECT name FROM Employee
WHERE salary > (SELECT MAX(salary) FROM Employee WHERE dept_id = 10);
Example 4: Conflict serialisability check
Schedule S: R1(A), R2(B), W1(B), R2(A), W2(A), W1(A).
Conflicts:
- W1(B) and R2(B): R2(B) is before W1(B)? No, R2(B) step 2, W1(B) step 3. R2 reads B before T1 writes B. Edge T2 -> T1.
- W1(A) (step 6) and W2(A) (step 5): W2 before W1. Edge T2 -> T1.
- R2(A) (step 4) and W1(A) (step 6): R2 before W1. Edge T2 -> T1.
- W2(A) (step 5) and W1(A) (step 6): already noted.
- R1(A) (step 1) and W2(A) (step 5): T1 reads before T2 writes. Edge T1 -> T2.
Precedence graph: T1 -> T2 and T2 -> T1 (cycle). Not conflict-serialisable.
Example 5: B+ tree height calculation
B+ tree with order p = 50 (each internal node has up to 50 pointers). Leaf nodes hold up to 49 key-pointer pairs.
Number of records: n = 1,000,000.
At height h, maximum leaf nodes = (50/2)^(h-1) * (leaves per path). Minimum leaf nodes at height h: each internal node has ceil(50/2) = 25 children (except root with at least 2). Minimum leaves = 2 * 25^(h-2) for h >= 2.
With 1,000,000 records and 49 per leaf, we need at least 1,000,000/49 approx 20,409 leaves. For height 3: minimum leaves = 2 * 25 = 50. Maximum = 50^2 = 2500. Not enough. Height 4: minimum leaves = 2 * 25^2 = 1250, max = 50^3 = 125,000. Not enough for 20,409. Height 4 max 125,000 > 20,409. So height 4 or 5 suffices. Ceiling(log_25(20409)) + 1 = ceiling(log_25(20409)) + 1. log_25(20409) = log(20409)/log(25) = 4.31/1.40 = 3.08. So height 4 is needed.
Quick-revision summary
- Relational algebra: sigma (filter rows), pi (select columns), |><| (natural join), / (division for "all" queries).
- HAVING filters groups; WHERE filters rows. Aggregate functions not allowed in WHERE.
- Candidate key: minimal superkey. Find by computing attribute closure.
- 2NF: no partial dependencies. 3NF: no transitive dependencies. BCNF: every determinant is a superkey.
- BCNF decomposition may lose some FDs. 3NF synthesis preserves all FDs.
- Conflict serializability: precedence graph acyclic. View serializability is weaker.
- 2PL guarantees conflict serializability but can deadlock.
- Strict schedules: prevent cascading rollbacks. Strict 2PL ensures strict schedules.
- B+ tree preferred over B-tree for databases due to linked leaves (efficient range queries).
- Lossless join: common attributes in decomposition form a superkey of at least one part.
How to study this unit
- Practise computing attribute closures and finding candidate keys for 5-6 sets of FDs. Time yourself to do it in under 2 minutes per question.
- Work through normalisation decomposition: given a relation and its FDs, decompose to 3NF (synthesis algorithm) and BCNF (decomposition algorithm). Verify lossless join and dependency preservation.
- Write SQL for "find X who satisfies condition involving Y from another table" type questions. Practise correlated subqueries and EXISTS.
- Trace conflict serializability using precedence graphs for 3-4 schedules. Mix serialisable and non-serialisable.
- Practise B+ tree height calculation with different orders and record counts. Know the formula for minimum and maximum leaves at each height.
- Revise ACID properties and the recoverable/cascadeless/strict hierarchy until you can classify any schedule by inspection.
Prefer watching over reading?
Subscribe for free.