Unit 5 of 79 min read

Database Management & Warehousing

ER, relational model, SQL, normalization, indexing, OLAP, multidimensional models.

Share:WhatsAppLinkedIn

Why this unit matters

Databases are the first point of contact with real data in any production system. GATE DA tests whether you can design schemas correctly, write SQL accurately, reason about integrity constraints, and understand how data is stored and retrieved efficiently. The warehousing portion connects directly to data analytics workflows, star schemas, fact tables, and concept hierarchies are how business intelligence systems are structured. This unit is highly formulaic: once you understand the rules, questions become predictable.

Syllabus map

Sub-topic Key concepts
ER model Entities, attributes, relationships, cardinality
Relational model Tables, tuples, domains, keys
Relational algebra Select, project, join, union, difference
Tuple calculus Declarative query notation
SQL DDL, DML, joins, aggregation, subqueries
Integrity constraints Domain, entity, referential integrity
Normal forms 1NF, 2NF, 3NF, BCNF
File organisation Heap, sequential, indexed
Indexing B+ tree, hash index
Data transformation Normalisation, discretisation, sampling, compression
Data warehousing Star/snowflake schema, fact tables, dimensions, measures

ER model

An Entity-Relationship (ER) diagram captures entities (things that exist), their attributes (properties), and relationships between them.

Cardinality ratios:

  • 1:1, each entity on both sides associates with at most one on the other., 1:N, one entity associates with many on the other side., M:N, many entities on both sides.

Participation:

  • Total (double line in ER): every entity in the set must participate., Partial (single line): participation is optional.

Weak entity: cannot be uniquely identified by its own attributes alone. It depends on a strong entity (owner) through an identifying relationship. Its primary key = partial key + owner's primary key.

Trap. Total participation does not mean the relationship is 1:1. They are independent notions.

Relational model and keys

A relation is a set of tuples (rows) over a schema (set of attribute names with domains). No two tuples are identical (set semantics).

Key types:

  • Super key: any set of attributes that uniquely identifies a tuple., Candidate key: minimal super key (removing any attribute breaks uniqueness)., Primary key: the chosen candidate key., Foreign key: an attribute (or set) in one table that references the primary key of another table.

Trap. A primary key is always a candidate key, but a candidate key is not always the primary key (the designer chooses one).

Relational algebra

The six fundamental operators:

  1. Selection (sigma): filters rows. sigma_{condition}(R)
  2. Projection (pi): filters columns. pi_{A1,A2}(R)
  3. Union (union): combines two compatible relations (same schema).
  4. Set difference (-): tuples in R but not S.
  5. Cartesian product (x): all combinations of tuples.
  6. Rename (rho): renames relation or attributes.

Join is a derived operator:

  • Natural join: joins on all common attributes and removes duplicates., Theta join: join with an arbitrary condition., Equi-join: theta join with equality condition., Left/right outer join: preserves all tuples from one side, fills NULLs.

Trap. Natural join can lose information if the common attribute has different meanings in the two relations. Always verify what attributes are being joined on.

SQL essentials

-- This is illustrative SQL (not Python)
-- Create table with integrity constraints
CREATE TABLE Student (
    sid   INT PRIMARY KEY,
    name  VARCHAR(50) NOT NULL,
    dept  VARCHAR(20),
    gpa   DECIMAL(3,2) CHECK (gpa BETWEEN 0 AND 10)
);

-- Basic query
SELECT name, gpa
FROM Student
WHERE dept = 'CS'
ORDER BY gpa DESC;

-- Aggregation with GROUP BY and HAVING
SELECT dept, AVG(gpa) AS avg_gpa
FROM Student
GROUP BY dept
HAVING AVG(gpa) > 7.5;

-- Inner join
SELECT S.name, C.course_name
FROM Student S
JOIN Enrolment E ON S.sid = E.sid
JOIN Course C ON E.cid = C.cid;

-- Correlated subquery: students with above-average GPA in their dept
SELECT name
FROM Student S1
WHERE gpa > (
    SELECT AVG(gpa)
    FROM Student S2
    WHERE S2.dept = S1.dept
);

HAVING vs WHERE. WHERE filters individual rows before grouping. HAVING filters groups after aggregation. You cannot use aggregate functions in WHERE.

Integrity constraints

  • Domain constraint: each attribute value must come from its domain (correct type and range).
  • Entity integrity: primary key attributes cannot be NULL.
  • Referential integrity: a foreign key value must either be NULL or match an existing primary key in the referenced table.

Referential integrity actions on delete: CASCADE (delete the referencing row too), SET NULL, SET DEFAULT, RESTRICT (prevent the delete).

Normalisation and normal forms

Normalisation eliminates redundancy and update anomalies by decomposing tables based on functional dependencies.

Functional dependency (FD): A -> B means knowing A uniquely determines B.

1NF. Each attribute must be atomic (no multivalued attributes, no repeating groups).

2NF. In 1NF, and no non-prime attribute is partially dependent on a candidate key. (Partial dependency: a non-prime attribute depends on only part of a composite key.)

3NF. In 2NF, and no non-prime attribute is transitively dependent on a primary key. (Transitive: A -> B -> C where B is not a key.)

BCNF. For every non-trivial FD A -> B, A must be a super key.

BCNF is strictly stronger than 3NF. A BCNF decomposition may not preserve all FDs. A 3NF decomposition always preserves FDs.

Trap. BCNF decomposition may lose FDs. GATE questions sometimes ask which normal form to use when FD preservation is required, the answer is 3NF.

Worked normalisation. Relation: StudentCourse(sid, cid, student_name, course_name, grade)., FDs: sid -> student_name; cid -> course_name; (sid, cid) -> grade., Candidate key: (sid, cid)., Violates 2NF: sid -> student_name (partial dependency)., Decompose: Student(sid, student_name), Course(cid, course_name), Enrolment(sid, cid, grade).

Indexing and file organisation

Heap file: records stored in no particular order. Insert: O(1). Search: O(n).

Sequential file: records sorted by key. Search: O(log n) with binary search. Insertion is expensive (may require shifting).

B+ tree index: balanced tree. All data in leaf nodes; leaf nodes linked for range queries. Search, insert, delete: O(log n). This is the most common index structure in practice.

Hash index: maps key to bucket using a hash function. Excellent for equality queries (O(1) average). Poor for range queries.

Primary index: on the sorted key field of the data file. One entry per data block. Secondary index: on a non-ordering field. One entry per record (or per value).

Data transformation

Before feeding data into ML models, data engineers apply:

  • Normalisation: scaling values to a standard range (e.g., 0 to 1, or z-score standardisation).
  • Discretisation: converting continuous values to discrete bins (e.g., ages into age groups).
  • Sampling: selecting a representative subset of data.
  • Compression: reducing storage size using lossless or lossy techniques.

These are tested as conceptual definitions. Know the difference between normalisation (in the ML/data sense) and normalisation (in the DB sense, reducing redundancy).

Data warehousing

A data warehouse is a subject-oriented, integrated, time-variant, non-volatile collection of data for decision support.

Star schema: one central fact table surrounded by dimension tables. Simple, fast for OLAP queries.

Snowflake schema: dimension tables are further normalised into sub-dimension tables. More normalised than star but more complex to query.

Fact table: contains measurable, quantitative data (measures) like sales amount, quantity sold. Has foreign keys to all dimension tables plus a composite primary key.

Dimension table: describes the context of the facts (who, what, when, where), Product, Customer, Date, Region.

Measures: the numeric values being analysed (e.g., revenue, count, average).

Concept hierarchy: organises dimension values at multiple levels of abstraction. For Date: day -> month -> quarter -> year. Enables drill-down and roll-up operations.

Worked examples

Example 1. Relational algebra: Find names of students enrolled in course C101.

pi_{name}(sigma_{cid='C101'}(Student JOIN Enrolment))

Example 2. A relation has schema R(A, B, C, D) with FDs: A -> B, B -> C, A -> D. What is the highest normal form?

Candidate key: A (since A -> B, B -> C, so A -> C, and A -> D). Is it in 3NF? A -> B: A is a super key. B -> C: B is not a super key and C is not prime. This is a transitive dependency. Not in 3NF. So the relation is in 2NF (no partial dependencies since the key is single-attribute, but has a transitive dependency).

Example 3. In a B+ tree of order 4 (max 4 pointers per node), what is the maximum number of keys a non-leaf node can hold?

Order 4 means max 4 pointers, so max 3 keys per non-leaf node.

Example 4. Which SQL query returns departments with more than 5 students?

SELECT dept, COUNT(*) AS cnt
FROM Student
GROUP BY dept
HAVING COUNT(*) > 5;

Example 5. A fact table for a retail star schema has dimensions: Product, Store, Date, Customer. The measure is sales_amount. What is the granularity if each row represents one transaction?

Granularity is at the transaction level, one row per sale, identified by (product_id, store_id, date_id, customer_id, transaction_id).

Quick-revision summary

  • ER: weak entity uses partial key + owner key. Participation (total/partial) is independent of cardinality., Candidate key is minimal; primary key is the chosen candidate key., 2NF: no partial FD on composite key. 3NF: no transitive FD. BCNF: every determinant is a super key., BCNF may not preserve FDs; 3NF always does., B+ tree: O(log n) all operations, supports range queries. Hash index: O(1) equality, no range., HAVING filters groups; WHERE filters rows., Star schema: one fact table + denormalised dimensions. Snowflake: normalised dimensions., Concept hierarchy enables drill-down and roll-up in OLAP.

How to study this unit

  1. Draw 3, 4 ER diagrams from scratch for realistic scenarios (library, hospital, university). Identify entity types, weak entities, cardinality, and participation.
  2. Practice writing relational algebra expressions for at least 10 queries. Translating English to algebra is a direct GATE skill.
  3. Write SQL for all standard patterns: GROUP BY/HAVING, self-joins, correlated subqueries, outer joins. Do 15 SQL exercises.
  4. Normalise at least 5 relations step by step: identify all FDs, find candidate keys, check each normal form, decompose.
  5. Understand B+ tree structure: how many keys per node, how search/insert/delete work. You do not need to implement it, know the operations and complexity.
  6. Study the star vs. snowflake schema difference with a concrete example. GATE DA has asked direct questions about fact tables, dimensions, and measures.

Test Your Knowledge

Quick MCQ check on this unit

Start Quiz →

Prefer watching over reading?

Subscribe for free.

Subscribe on YouTube