Suparn Patra: CS Notes, GATE/NET, CBSE & Code Playground

Why this unit matters

DBMS is one of the most reliably tested units in NET CS. Questions are a mix of definition recall (Codd's rules, ACID properties) and analytical problem-solving (normalise a relation, write a relational algebra expression, analyse a transaction schedule for serializability). Candidates with hands-on SQL experience will find this unit approachable, but the theoretical portions, relational algebra, normalisation theory, and concurrency control, require careful study. Expect 8-12 questions per paper from this unit.

Syllabus map

Every named topic in the official UGC NET CS Code 87 syllabus for Unit 4:

Database Concepts and Architecture: data vs information, database vs file system, data models (hierarchical, network, relational, object-oriented), schema vs instance, three-schema architecture (internal, conceptual, external), logical and physical data independence, database languages (DDL, DML, DCL, TCL), interfaces (forms, GUI, NLI, programmatic), centralised vs client-server architectures, Data Modelling: ER model (entities, attributes, relationships, cardinality, participation constraints, weak entities, ISA hierarchies, aggregation), relational model (tuples, attributes, domains, relation, schema, constraints), integrity constraints (domain, key, entity integrity, referential integrity), relational algebra, tuple relational calculus, domain relational calculus, Codd's 12 rules, SQL: DDL (CREATE TABLE, ALTER TABLE, DROP TABLE), data types (INT, VARCHAR, DATE, FLOAT, BLOB), constraints (PRIMARY KEY, FOREIGN KEY, UNIQUE, NOT NULL, CHECK, DEFAULT), DML queries (SELECT, FROM, WHERE, GROUP BY, HAVING, ORDER BY, DISTINCT, JOIN types, subqueries, correlated subqueries), INSERT, DELETE, UPDATE, views, stored procedures, user-defined functions, triggers (BEFORE/AFTER, INSERT/UPDATE/DELETE), SQL injection, Normalisation: functional dependencies, Armstrong's axioms, closure of attributes, candidate keys, minimal cover, 1NF, 2NF, 3NF, BCNF, 4NF, 5NF, query processing pipeline (parsing, translation, optimisation, evaluation), query optimisation (equivalence rules, cost estimation, index selection), transaction properties (ACID), concurrency control (lock-based protocols, 2PL, timestamp ordering, MVCC), serializability (conflict, view), deadlock in DBMS, recovery (log-based, undo-redo, checkpoints, shadow paging), database security and authorisation (GRANT, REVOKE, role-based access), Enhanced Data Models: temporal databases, multimedia databases, deductive databases (Datalog), XML databases, mobile databases, GIS, genome data, distributed databases, client-server databases, Data Warehousing and Mining: data warehouse architecture, ETL, concept hierarchy, OLAP operations (roll-up, drill-down, slice, dice, pivot), OLTP vs OLAP, association rules (Apriori algorithm, support, confidence, lift), classification algorithms (decision tree, naive Bayes), clustering (k-means, hierarchical, DBSCAN), regression, SVM, kNN, HMM, summarisation, dependency modelling, link analysis, sequence analysis, social network analysis, Big Data: characteristics (volume, velocity, variety, veracity, value, the 5 Vs), types of big data, big data architecture, MapReduce programming model, Hadoop ecosystem (HDFS, YARN, MapReduce, Hive, HBase, Pig), NoSQL: types (key-value, document, column-family, graph), CAP theorem, BASE properties, NoSQL query languages, products (MongoDB, Cassandra, Redis, Neo4j), indexing in NoSQL, NoSQL on cloud platforms

Database Concepts and Architecture

Three-Schema Architecture

The three-schema (ANSI/SPARC) architecture separates the way data is physically stored from how it is conceptually organised and how individual users see it.

Internal schema: describes physical storage, file organisation, indexes, storage structures., Conceptual schema: the community view, all entities, relationships, and constraints. What a DBA works with., External schema (view): a subset of the conceptual schema tailored for a specific user or application.

Physical data independence: the conceptual schema does not change when the internal schema changes (e.g., adding an index). Logical data independence: the external schemas do not change when the conceptual schema changes (harder to achieve).

Data Models

Hierarchical: data as a tree. Records linked parent-to-child. Navigational, not flexible. Network: data as a graph. Pointers between records. More flexible than hierarchical, still navigational. Relational: data as tables. Operations via algebra. No pointers, joins are computed on the fly. Object-oriented: data as objects with identity, state, and behaviour.

Data Modelling

Entity-Relationship Model

Entity: a real-world object (Student, Course). Weak entity: cannot be uniquely identified by its own attributes alone; depends on a strong entity (e.g., a Dependent of an Employee).

Attribute types: simple, composite, multi-valued (double ellipse in ER diagram), derived (dashed ellipse).

Relationship: an association among entities. Binary (two entities), ternary (three), etc.

Cardinality: one-to-one (1:1), one-to-many (1:N), many-to-many (M:N). Participation: total participation (every entity instance must participate, double line) vs partial participation (single line).

Relational Algebra

The five basic operations:

Selection (sigma): filters rows. sigma_{condition}(R) returns tuples satisfying the condition., Projection (pi): filters columns. pi_{A1,A2,...}(R) returns only the listed attributes., Union (union): combines tuples from two compatible relations; removes duplicates., Set difference (minus): tuples in R1 but not in R2., Cartesian product (x): all combinations of tuples from R1 and R2.

Derived operations:

Natural join: join on common attributes; eliminates duplicate columns., Theta join: Cartesian product followed by selection on a condition., Division: R1 / R2 gives tuples in R1 that are associated with all tuples in R2.

Codd's 12 Rules

Codd defined 12 rules (Rule 0 to Rule 12) for a fully relational DBMS. Key ones tested in NET:

Rule 1 (Information Rule): all information is represented as values in tables., Rule 2 (Guaranteed Access): every datum is accessible by table name, primary key, and attribute name., Rule 6 (View Updating): all updatable views must be updatable by the system., Rule 9 (Logical Data Independence): application programs are unaffected by logical schema changes., Rule 12 (Nonsubversion): no low-level access should bypass integrity constraints.

SQL

DDL Examples

CREATE TABLE Student (
    SID      INT          PRIMARY KEY,
    Name     VARCHAR(50)  NOT NULL,
    DOB      DATE,
    Dept     VARCHAR(30)  DEFAULT 'CSE',
    CGPA     FLOAT        CHECK (CGPA BETWEEN 0 AND 10)
);

CREATE TABLE Enrolment (
    SID      INT,
    CID      INT,
    Grade    CHAR(1),
    PRIMARY KEY (SID, CID),
    FOREIGN KEY (SID) REFERENCES Student(SID) ON DELETE CASCADE,
    FOREIGN KEY (CID) REFERENCES Course(CID)
);

DML, Queries and Joins

-- Students in CSE with CGPA above 8, ordered by CGPA
SELECT Name, CGPA
FROM   Student
WHERE  Dept = 'CSE' AND CGPA > 8.0
ORDER BY CGPA DESC;

-- Total enrolments per course
SELECT CID, COUNT(*) AS enrolments
FROM   Enrolment
GROUP BY CID
HAVING COUNT(*) > 10;

-- Students and the courses they are enrolled in (inner join)
SELECT S.Name, C.CourseName
FROM   Student S
JOIN   Enrolment E ON S.SID = E.SID
JOIN   Course   C ON E.CID = C.CID;

-- Students not enrolled in any course (left join with NULL check)
SELECT S.Name
FROM   Student S
LEFT JOIN Enrolment E ON S.SID = E.SID
WHERE  E.SID IS NULL;

Views, Stored Procedures, Triggers

-- View: a named virtual table
CREATE VIEW CSE_Students AS
    SELECT SID, Name, CGPA FROM Student WHERE Dept = 'CSE';

-- Stored procedure
DELIMITER //
CREATE PROCEDURE GetStudentGrades(IN student_id INT)
BEGIN
    SELECT E.CID, E.Grade FROM Enrolment E WHERE E.SID = student_id;
END //
DELIMITER ;

-- Trigger: auto-log deletions
CREATE TRIGGER after_student_delete
AFTER DELETE ON Student
FOR EACH ROW
    INSERT INTO AuditLog(SID, Action, LogTime) VALUES (OLD.SID, 'DELETE', NOW());

SQL Injection

SQL injection is an attack where a user injects SQL code into an input field that is concatenated into a query.

Vulnerable code (conceptual): SELECT * FROM users WHERE name = ' + user_input + '; If user_input is ' OR '1'='1, the query returns all rows.

Prevention: use parameterised queries (prepared statements), stored procedures, input validation, and least-privilege database accounts.

Normalisation

Functional Dependencies

A functional dependency X -> Y means: for any two tuples with the same X value, they must also have the same Y value. X functionally determines Y.

Armstrong's Axioms:

Reflexivity: if Y is a subset of X, then X -> Y., Augmentation: if X -> Y, then XZ -> YZ., Transitivity: if X -> Y and Y -> Z, then X -> Z.

Derived rules: union, decomposition, pseudo-transitivity.

Closure of a set of attributes X (denoted X+): the set of all attributes functionally determined by X under a set F of FDs.

Candidate key: a minimal set of attributes whose closure is the entire relation.

Minimal cover: a canonical (minimal) set of FDs equivalent to F. No redundant FDs and each FD has a single attribute on the right.

Normal Forms

1NF: all attribute values are atomic (no multi-valued or composite attributes).

2NF: 1NF + no partial dependency. Every non-key attribute is fully functionally dependent on the entire primary key (relevant only when the primary key is composite).

3NF: 2NF + no transitive dependency. No non-key attribute depends on another non-key attribute. A relation is in 3NF if for every FD X -> A: X is a superkey, or A is a prime attribute (part of some candidate key).

BCNF (Boyce-Codd Normal Form): stronger than 3NF. For every FD X -> A: X must be a superkey. Every 3NF relation is not necessarily in BCNF. BCNF decomposition may not preserve all FDs.

4NF: BCNF + no non-trivial multi-valued dependencies. 5NF (Project-Join Normal Form): eliminates join dependencies.

In practice, 3NF is used when FD preservation is critical; BCNF is used when eliminating all redundancy is the priority.

Transactions and ACID

ACID properties:

Atomicity: a transaction is all-or-nothing. If any part fails, all changes are rolled back., Consistency: a transaction takes the database from one consistent state to another., Isolation: concurrent transactions execute as if serialised. Intermediate states are not visible to others., Durability: once committed, changes survive failures.

Concurrency Control

Schedule: the order of operations from concurrent transactions. A schedule is serialisable if its effect is equivalent to some serial schedule.

Conflict serialisability: two operations conflict if they access the same item and at least one is a write. A schedule is conflict-serialisable if its conflict graph (precedence graph) has no cycles.

Two-Phase Locking (2PL): a transaction acquires all locks before releasing any lock. Growing phase (acquire), shrinking phase (release). Guarantees conflict serialisability. Strict 2PL: release all locks only after commit/abort.

Timestamp ordering: each transaction gets a timestamp. If a later transaction tries to read or write data last updated by an earlier transaction in a conflicting way, it is rolled back.

MVCC (Multi-Version Concurrency Control): each write creates a new version; readers see a consistent snapshot. Reduces blocking; used in PostgreSQL, Oracle.

Recovery

Log-based recovery: a write-ahead log (WAL) records all changes before they are applied. On crash: redo committed transactions, undo uncommitted ones.

Undo/redo log: each log entry is (transaction, item, old value, new value).

Checkpoint: periodically flush all dirty pages and write a checkpoint record. On recovery, only start from the last checkpoint.

Data Warehousing and Mining

OLTP vs OLAP

OLTP (Online Transaction Processing): operational databases, many small concurrent transactions, normalised schema, optimised for writes and point queries.

OLAP (Online Analytical Processing): data warehouses, complex aggregation queries over large historical datasets, denormalised (star or snowflake schema), optimised for reads.

Star Schema

Fact table: central table with foreign keys to dimension tables and numeric measures (e.g., sales amount, quantity). Dimension tables: descriptive attributes (Product, Customer, Time, Store). Snowflake schema: dimension tables further normalised into multiple levels.

Association Rules

Support(A -> B) = (transactions containing both A and B) / (total transactions). Confidence(A -> B) = (transactions containing both A and B) / (transactions containing A). Lift(A -> B) = Confidence(A -> B) / Support(B).

Apriori algorithm: generate frequent itemsets level by level using the anti-monotone property (if an itemset is infrequent, all its supersets are also infrequent). Generate rules from frequent itemsets.

Classification Algorithms

Decision tree: recursively split data on the attribute that maximises information gain (ID3) or gain ratio (C4.5) or Gini impurity reduction (CART). Leaf nodes are class labels.

Naive Bayes: assumes all features are conditionally independent given the class. P(class | features) proportional to P(class) * product of P(feature_i | class). Fast, effective for text classification.

Clustering

k-means: specify k clusters. Initialise k centroids randomly. Assign each point to the nearest centroid. Recalculate centroids. Repeat until convergence. Sensitive to initialisation; requires k to be specified.

DBSCAN: density-based. A core point has at least MinPts neighbours within radius eps. Can find clusters of arbitrary shape; identifies noise points.

Big Data and NoSQL

MapReduce and Hadoop

MapReduce: a programming model for processing large datasets in parallel., Map phase: each mapper processes a subset of input records and emits (key, value) pairs., Shuffle and sort: the framework groups all values for the same key together., Reduce phase: each reducer processes all values for a key and emits the final output.

HDFS (Hadoop Distributed File System): stores large files across a cluster. Files are split into blocks (default 128 MB) and replicated (default 3 copies) for fault tolerance. NameNode manages metadata; DataNodes store blocks.

CAP Theorem and NoSQL

CAP theorem (Brewer): a distributed data store can guarantee at most two of three properties: Consistency, Availability, Partition tolerance.

Since network partitions are inevitable in distributed systems, NoSQL databases trade off either C or A.

NoSQL types:

Key-value: Redis, DynamoDB. Simple get/put by key. Very fast., Document: MongoDB. Stores JSON/BSON documents. Flexible schema., Column-family: Cassandra, HBase. Rows have many columns; sparse storage. Good for time-series., Graph: Neo4j. Stores nodes and edges. Efficient for relationship queries.

BASE (Basically Available, Soft state, Eventually consistent): the alternative to ACID for distributed NoSQL systems.

Worked Examples

Example 1: Relational Algebra

Relation Employee(EID, Name, Dept, Salary). Relation Dept(DName, ManagerEID). Query: Find names of employees in the 'CS' department with salary > 50000.

pi_{Name}(sigma_{Dept='CS' AND Salary>50000}(Employee))

Example 2: Normalisation, Identify Normal Form

Relation R(A, B, C, D) with primary key (A, B) and FDs: A -> C, AB -> D. Check 2NF: is there a partial dependency? A -> C means C depends only on A, which is part of the key. This is a partial dependency. R is in 1NF but not 2NF.

Decompose to 2NF: R1(A, C) and R2(A, B, D).

Example 3: SQL, Correlated Subquery

Find students whose CGPA is above the average CGPA of their department.

SELECT S.Name, S.Dept, S.CGPA
FROM   Student S
WHERE  S.CGPA > (
    SELECT AVG(S2.CGPA)
    FROM   Student S2
    WHERE  S2.Dept = S.Dept
);

The subquery runs once for each row of the outer query, this is a correlated subquery.

Example 4: Transaction Serializability

Transactions T1 and T2 execute the following schedule: R1(X) W2(X) W1(X) R2(Y). Conflicts: W2(X) and W1(X) conflict (both write X, different transactions). Draw precedence graph: T2 -> T1 (W2 before W1 on X). No cycle. The schedule is conflict-serialisable and equivalent to serial order T2 then T1.

Example 5: MapReduce, Word Count

The canonical MapReduce example:

Map function: for each word w in a line, emit (w, 1). Shuffle: groups all (w, 1) pairs by word. Reduce function: for each word w, sum all 1s and emit (w, total_count).

Quick-Revision Summary

Three-schema architecture: internal, conceptual, external., Three-schema enables physical and logical data independence., Codd's Rule 1: all information represented as values in tables., Relational algebra basic operations: sigma (select), pi (project), union, minus, x (Cartesian product)., 2NF eliminates partial dependencies. 3NF eliminates transitive dependencies. BCNF: every determinant is a superkey., ACID: Atomicity, Consistency, Isolation, Durability., Conflict serialisability: no cycle in precedence graph., Strict 2PL guarantees conflict serialisability., Support and confidence are the two metrics for association rules., Apriori uses anti-monotone property to prune itemsets., MapReduce: Map emits (k,v) pairs; Reduce aggregates by key., CAP: choose at most two of Consistency, Availability, Partition tolerance., NoSQL types: key-value, document, column-family, graph.

How to Study This Unit

Relational algebra is almost always tested, practise translating English queries to algebra and back.
Normalisation: work through at least 10 examples of identifying 1NF/2NF/3NF/BCNF violations and performing decomposition.
SQL: be comfortable with all JOIN types, GROUP BY with HAVING, and correlated subqueries. Write queries by hand.
Concurrency control: draw precedence graphs for at least 5 sample schedules to internalise conflict serialisability.
Data warehousing questions on NET tend to be definitional (OLAP vs OLTP, star vs snowflake), make a clear comparison table.
Big Data and NoSQL questions are increasing in frequency. Know the CAP theorem and the four NoSQL types with examples.
Do not skip Codd's rules, they are a favourite for 1-mark definitional questions.

Database Management Systems