Probability & Statistics

Why this unit matters

Probability and statistics show up directly in about 15, 18% of GATE DA questions, and indirectly in machine learning, AI inference, and data preprocessing topics. If you can reason about randomness precisely, distinguishing independence from mutual exclusivity, reading a Bayes' theorem problem without panic, computing a confidence interval from scratch, you have a huge advantage over students who treat this unit as secondary. Every ML algorithm assumes something about data distributions. Understanding those assumptions starts here.

Syllabus map

Sub-topic	Key concepts
Counting	Permutations, combinations, multinomial
Probability foundations	Axioms, sample space, events
Event relationships	Independent, mutually exclusive, complementary
Multi-variable probability	Marginal, conditional, joint; Bayes' theorem
Descriptive statistics	Mean, median, mode, std dev, correlation, covariance
Random variables	Discrete and continuous; CDF, PDF
Named distributions	Uniform, Bernoulli, Binomial, Poisson, Exponential, Normal, t, chi-squared
Limit theorems	CLT, LLN
Inference	Confidence intervals, z-test, t-test, chi-squared test

Counting: permutations and combinations

Permutations count ordered arrangements; combinations count unordered selections.

P(n, r) = n! / (n, r)! (ordered, without replacement) C(n, r) = n! / (r! * (n, r)!) (unordered, without replacement)

Common trap. Students confuse "arrangements" with "selections". If the problem says "how many ways to choose a committee", order does not matter, use C. If it says "how many ways to assign roles", order matters, use P.

Worked example. A password uses 3 distinct digits from {0,1,...,9}. How many passwords are possible?

Answer: P(10, 3) = 10 * 9 * 8 = 720. Order matters because "123" and "321" are different passwords.

Probability axioms and event relationships

The three axioms (Kolmogorov):

P(A) >= 0 for all events A.
P(sample space) = 1.
For mutually exclusive A and B: P(A union B) = P(A) + P(B).

For non-exclusive events: P(A union B) = P(A) + P(B), P(A intersection B).

Independent vs. mutually exclusive. These are different ideas that students regularly confuse., Independent: P(A intersection B) = P(A) * P(B). Knowing A happened tells you nothing about B., Mutually exclusive: P(A intersection B) = 0. They cannot both happen.

Two non-trivial events cannot be both independent and mutually exclusive (if they are mutually exclusive, P(A intersection B) = 0, but P(A) * P(B) > 0 for non-trivial events, a contradiction).

Conditional probability and Bayes' theorem

Conditional probability: P(A | B) = P(A intersection B) / P(B).

Bayes' theorem: P(A | B) = P(B | A) * P(A) / P(B).

The denominator P(B) is expanded using total probability: P(B) = sum over all partitions Ai of P(B | Ai) * P(Ai).

Classic GATE-style trap. A disease affects 1% of a population. A test is 99% sensitive (detects true positives) and 99% specific (correctly rejects negatives). You test positive. What is the probability you have the disease?

P(disease | positive) = P(positive | disease) * P(disease) / P(positive) = 0.99 * 0.01 / (0.99 * 0.01 + 0.01 * 0.99) = 0.0099 / 0.0198 = 0.5

Only 50%, not 99%. This is the base-rate fallacy. When prevalence is low, even a highly accurate test produces many false positives.

Descriptive statistics

For a dataset x1, x2, ..., xn:

Mean (mu) = sum(xi) / n, Variance = sum((xi, mu)^2) / n (population); divide by n-1 for sample variance, Standard deviation = sqrt(variance), Covariance(X, Y) = E[(X, mu_X)(Y, mu_Y)], Correlation = Covariance(X, Y) / (std_X * std_Y)

Correlation is bounded: -1 <= r <= 1. Zero correlation does not mean independence (only for jointly normal variables does it imply independence).

Distributions you must know cold

Binomial B(n, p). n independent Bernoulli trials, each with success probability p. P(X = k) = C(n, k) * p^k * (1-p)^(n-k) Mean = np, Variance = np(1-p).

Poisson(lambda). Counts of rare events in a fixed interval when n is large and p is small. P(X = k) = e^(-lambda) * lambda^k / k! Mean = Variance = lambda. This equality is a quick check in exams.

Exponential(lambda). Time between Poisson events. Memoryless property: P(X > s + t | X > s) = P(X > t). Mean = 1/lambda, Variance = 1/lambda^2.

Normal N(mu, sigma^2). Symmetric, bell-shaped. Standard normal Z = (X, mu) / sigma. About 68% of values fall within 1 sigma of the mean, 95% within 2 sigma, 99.7% within 3 sigma.

t-distribution. Used when population std dev is unknown and sample is small. Heavier tails than normal; approaches normal as degrees of freedom increase.

Chi-squared. Sum of squares of k independent standard normals. Used in goodness-of-fit and contingency table tests.

Central Limit Theorem

If X1, X2, ..., Xn are i.i.d. with mean mu and variance sigma^2, then the sample mean X_bar has:

X_bar ~ N(mu, sigma^2 / n) approximately, for large n.

The CLT does not require the underlying distribution to be normal. This is why z-tests are valid for large samples regardless of the original distribution.

Hypothesis testing in brief

State null hypothesis H0 and alternative H1.
Choose significance level alpha (commonly 0.05).
Compute test statistic.
Reject H0 if test statistic falls in the rejection region (|z| > z_critical for two-tailed).

z-test: population variance known, or n >= 30., t-test: population variance unknown, small n., chi-squared test: categorical data (goodness of fit, independence of attributes).

Common trap. A p-value of 0.03 does not mean the probability that H0 is true is 3%. It means: if H0 were true, there is a 3% chance of observing data at least as extreme as what you got.

Worked examples

Example 1. X ~ Poisson(3). Find P(X = 2).

P(X = 2) = e^(-3) * 3^2 / 2! = e^(-3) * 9 / 2 = 9 * 0.0498 / 2 ≈ 0.224.

Example 2. X and Y are independent. E[X] = 2, E[Y] = 3, Var(X) = 4, Var(Y) = 5. Find Var(2X, Y + 1).

Var(aX + bY + c) = a^2 * Var(X) + b^2 * Var(Y) when independent. Var(2X, Y + 1) = 4 * 4 + 1 * 5 = 16 + 5 = 21.

Note: constants (like +1) do not affect variance.

Example 3. A 95% confidence interval for the mean of a normal population (sigma = 10) is computed from n = 100 samples. Find the margin of error.

z_0.025 = 1.96. Margin = 1.96 * (10 / sqrt(100)) = 1.96 * 1 = 1.96.

Example 4. Correlation between X and Y is 0.6. Var(X) = 9, Var(Y) = 16. Find Cov(X, Y).

Cov(X, Y) = r * std_X * std_Y = 0.6 * 3 * 4 = 7.2.

Example 5. A fair coin is tossed 4 times. What is the probability of getting exactly 3 heads?

X ~ B(4, 0.5). P(X = 3) = C(4,3) * 0.5^3 * 0.5^1 = 4 * 0.125 * 0.5 = 0.25.

Quick-revision summary

P(A union B) = P(A) + P(B), P(A intersect B). Add back the overlap., Independent: P(AB) = P(A)P(B). Mutually exclusive: P(AB) = 0. Never both (for non-trivial events)., Bayes' theorem expresses posterior = likelihood * prior / marginal., Poisson mean = variance = lambda. Exponential is memoryless., CLT: X_bar ~ N(mu, sigma^2/n) for large n, regardless of original distribution., Correlation = 0 does not imply independence (except jointly normal)., t-test for unknown population variance / small n; z-test otherwise., p-value is not P(H0 is true). It is P(data | H0).

How to study this unit

Start with counting (2 days): work through 10 permutation/combination problems until you never confuse ordered vs. unordered selection again.
Build the Bayes' theorem intuition with the disease-test example: vary the prevalence and watch the posterior change dramatically.
Memorise the mean and variance formulas for all named distributions. Flash-card them. GATE regularly asks you to compute these directly.
Do at least 5 CLT problems where you are asked to find the probability that a sample mean falls in some range.
Practice writing out a full hypothesis test (state hypotheses, compute test statistic, compare to critical value) for z, t, and chi-squared scenarios.
In the last week before the exam, solve 2, 3 previous GATE DA papers covering this unit under time pressure.