Class XII · Chapter 7NOT in current CBSE syllabus (2025-26)16 min read
Share:WhatsAppLinkedIn

Chapter 7: Understanding Data

CBSE Unit: NOT in current CBSE syllabus (2025-26) Status: Supplementary, relevant to data science/AI trends in education Priority: LOW for exam, but good foundation for AI/data science content


Key Concepts

7.1 What is Data?

  • Data is a collection of characters, numbers, and other symbols that represent values, Singular: datum, Plural: data
  • Computers store data electronically for faster processing compared to manual methods, The ICT revolution (computers, mobile, Internet) has led to generation of large volumes of data at very fast pace, Data by itself cannot help in decision making, it needs to be processed and analysed

Distinction: Data vs Information vs Knowledge

Term Meaning Example
Data Raw, unprocessed facts 85, 90, 78, 92, 88 (marks of 5 students)
Information Processed data with meaning Average marks = 86.6, Highest = 92
Knowledge Understanding derived from information Class performance is good; focus on weaker students

7.2 Examples of Data

  • Personal data: name, age, gender, contact details
  • Transaction data: banking, shopping, ticketing (online or offline)
  • Media data: images (pixels), video (frames), audio, graphics, animations
  • Documents and web pages: text content, hyperlinks
  • Online posts: comments, messages, social media content
  • Sensor data: signals generated by IoT devices
  • Satellite data: meteorological data, communication data, earth observation data

7.3 Importance of Data

Data is crucial for decision making across various fields:

Domain How Data is Used
College admissions Placement data, faculty qualifications, fees, facilities
Government Census data for planning and policy formulation
Sports Analysing opponent team performances for strategy
Banking Customer accounts, transactions, fraud detection
Elections Electronic voting machines for recording and counting votes
Science Recording experimental results, comparing outcomes
Pharmaceutical Testing medicine effectiveness through clinical data
Libraries Book inventory, membership management
Search engines Analysing web data to provide relevant results
Weather Satellite data analysis for forecasts and alerts
Business Market analysis, customer feedback, dynamic pricing
Cab services Demand-based dynamic pricing (surge pricing)
Restaurants Sales data analysis for "happy hours" discounts

7.4 Types of Data

(A) Structured Data, Organised in a well-defined format (rows and columns), Stored in tabular format, tables, databases, spreadsheets, Each column = attribute/parameter/variable, Each row = observation/record, Easy to process and analyse using standard tools

Example: Kitchen items inventory

ModelNo ProductName UnitPrice Discount(%) Items_in_Inventory
ABC1 Water bottle 126 8 13
ABC2 Melamine Plates 320 5 45
ABC3 Dinner Set 4200 10 8
GH67 Jug 80 0 10

More examples of structured data:

Entity/Activity Data Fields (Attributes)
Books at a shop BookTitle, Author, Price, YearOfPublication
School fees StudentName, Class, RollNo, FeesAmount, DepositDate
ATM withdrawal AccHolderName, AccountNo, TypeOfAcc, DateOfWithdrawal, AmountWithdrawn, ATMid

(B) Unstructured Data

  • No predefined format or fixed structure, Cannot be stored in traditional row-and-column (tabular) format, Much harder to process and analyse than structured data, Examples: images, videos, audio files, emails, social media posts, web pages, news articles, business reports, A newspaper page has no fixed pattern, different number of images, articles, ads each day, An email has no fixed structure, varying number of lines, paragraphs, attachments

Metadata: Unstructured data is often described using metadata (data about data)., Email metadata: subject, recipient, sender, date, attachment count, Image metadata: file size (KB/MB), image type (JPEG, PNG), resolution, date taken, When you click a photograph on your phone, metadata like GPS location, date/time, camera settings is automatically recorded

(C) Semi-structured Data, Has some organizational properties but not as rigid as structured data, Contains tags or markers to separate elements, but no strict tabular format, Examples: JSON, XML, HTML, email headers, log files

Comparison Table:

Feature Structured Unstructured Semi-structured
Format Fixed (rows/columns) No fixed format Partially organized
Storage Tables, databases File systems, data lakes JSON, XML files
Examples Spreadsheets, SQL databases Images, videos, emails JSON, XML, HTML
Processing Easy with SQL, spreadsheets Needs special tools (NLP, CV) Moderate difficulty
Volume ~20% of all data ~80% of all data Varies

7.5 Data Collection

Data collection means identifying and gathering data from appropriate sources. Data can come from:

Methods of Data Collection:

Method Description Example
Manual entry Data available in diary/register, entered digitally Shopkeeper enters sales from register into spreadsheet
Already digital Data already in digital format CSV file from previous system
Software-generated Application collects data automatically POS (Point of Sale) software recording each sale
Surveys/Questionnaires Primary data collection from people Google Forms survey for customer feedback
Web scraping Extracting data from websites Collecting product prices from e-commerce sites
Sensors/IoT Automatic data generation by devices Temperature sensors, fitness trackers
Social media User-generated content Posts, comments, likes, shares
Existing databases Secondary data from organizations World Bank, IMF economic data

Real-world data collection scenarios:

  • Hospitals collect patient data for improving services, Shopping malls track items purchased (discovering patterns like "bedsheets and groceries are frequently bought together"), Political analysts analyse social media posts for public opinion, World Bank and IMF collect economic data from countries for forecasting

7.6 Data Storage

  • Process of storing data on storage devices for future retrieval and use, Huge volumes of data are generated at very high rates, storage is a challenge, Decreasing cost of digital storage has simplified this task

Common Storage Devices:

Device Type Typical Capacity
Hard Disk Drive (HDD) Magnetic 500 GB to 20 TB
Solid State Drive (SSD) Flash memory 128 GB to 8 TB
CD/DVD Optical 700 MB / 4.7-8.5 GB
Pen Drive (USB) Flash memory 8 GB to 512 GB
Memory Card Flash memory 16 GB to 1 TB
Tape Drive Magnetic Up to 30 TB
Cloud Storage Network-based Virtually unlimited

Storage formats:

  • Files: images, documents, audio/video stored as individual files
  • CSV files: comma-separated values for tabular data
  • Databases (DBMS): structured storage with efficient retrieval, overcomes limitations of simple file processing

7.7 Data Processing

Data processing converts raw data into meaningful information.

Steps in Data Processing:

  1. Data Collection - gather raw data
  2. Data Preparation/Entry - enter/import data into digital format
  3. Data Classification - organize/categorize data
  4. Processing - apply computations, calculations, transformations
  5. Storage - store processed data for future retrieval
  6. Output - generate results as reports, charts, tables

Data Processing Cycle:

Raw Data (Input) --> Processing --> Information (Output)
                        |
                    Store/Retrieve

Real-world Data Processing Examples:

Scenario Input Processing Output
Exam admit card Student details, photo, fees Verify eligibility, generate roll number Admit card with center details
ATM withdrawal PIN, account type, amount Verify PIN, check balance, deduct Cash + receipt
Train ticket Journey details, passenger info Check availability, allocate berth Ticket with PNR, berth number

Data Cleaning (Pre-processing): Before analysis, data often needs cleaning:

  • Remove duplicates: same record entered twice
  • Handle missing values: fill in or remove incomplete records
  • Fix errors: typos, incorrect entries
  • Standardize formats: dates (DD/MM/YYYY vs MM/DD/YYYY), units

7.8 Statistical Techniques for Data Processing

Statistical techniques help us summarise and understand data. They are divided into:

7.8.1 Measures of Central Tendency

A measure of central tendency is a single value that gives us some idea about the centre of the data.

(A) Mean (Average)

  • Sum of all values divided by the number of values, Formula: Mean = (x1 + x2 + ... + xn) / n
  • Sensitive to outliers - one extreme value can significantly change the mean
Example: Heights (in cm) = [90, 102, 110, 115, 85, 90, 100, 110, 110]
Mean = (90 + 102 + 110 + 115 + 85 + 90 + 100 + 110 + 110) / 9
     = 912 / 9
     = 101.33 cm

Effect of outliers on mean:

Original data:  [10, 12, 14, 11, 13]    Mean = 12.0
With outlier:   [10, 12, 14, 11, 13, 100]  Mean = 26.67  (misleading!)

The outlier (100) drastically changes the mean. Remove outliers before computing mean.

Python code to calculate mean:

data = [90, 102, 110, 115, 85, 90, 100, 110, 110]
mean = sum(data) / len(data)
print(f"Mean = {mean:.2f}")  # Mean = 101.33

(B) Median (Middle Value)

  • When all values are sorted in ascending/descending order, the middle value is the median
  • Odd number of values: median = middle value
  • Even number of values: median = average of two middle values
  • Not affected by outliers - better than mean for skewed data
Example (odd count):
Sorted: [85, 90, 90, 100, 102, 110, 110, 110, 115]  (9 values)
Median = value at position 5 = 102 cm

Example (even count):
Data: [3, 7, 8, 12, 14, 18]  (6 values)
Median = (8 + 12) / 2 = 10.0

Python code to calculate median:

data = [90, 102, 110, 115, 85, 90, 100, 110, 110]
data_sorted = sorted(data)
n = len(data_sorted)
if n % 2 == 1:
    median = data_sorted[n // 2]
else:
    median = (data_sorted[n // 2 - 1] + data_sorted[n // 2]) / 2
print(f"Median = {median}")  # Median = 102

(C) Mode (Most Frequent)

  • Value that appears the most number of times in the data, A dataset can have no mode (all values unique), one mode, or multiple modes
  • Can be found for both numeric and non-numeric data (e.g., most popular car colour)
Example:
Heights: [85, 90, 90, 100, 102, 110, 110, 110, 115]
Mode = 110 (appears 3 times -  highest frequency)

Example (no mode):
Data: [5, 8, 12, 3, 7]   (each value appears once -  no mode)

Example (multiple modes):
Data: [1, 2, 2, 3, 3, 4]  (both 2 and 3 appear twice -  bimodal)

Python code to calculate mode:

data = [90, 102, 110, 115, 85, 90, 100, 110, 110]
from collections import Counter
freq = Counter(data)
max_count = max(freq.values())
modes = [val for val, count in freq.items() if count == max_count]
print(f"Mode = {modes}")  # Mode = [110]

When to use which measure:

Measure Best Used When Not Good For
Mean Data is evenly distributed, no extreme values Data with outliers
Median Data has outliers or is skewed Categorical data
Mode Finding most common/popular value Data where all values are unique

7.8.2 Measures of Variability (Dispersion)

Measures of variability describe the spread or variation of values around the mean. Two datasets can have the same mean but very different spreads.

(A) Range

  • Difference between maximum and minimum values, Formula: Range = Maximum, Minimum, Calculated only for numerical data
  • Tells about the coverage/spread of data
  • Sensitive to outliers (uses only two extreme values)
Example:
Heights: [85, 90, 90, 100, 102, 110, 110, 110, 115]
Range = 115 - 85 = 30 cm

Salaries: [25000, 28000, 30000, 35000, 500000]
Range = 500000 - 25000 = 475000 (misleading due to outlier!)

(B) Standard Deviation

  • Measures spread of data using all values (not just extremes like Range), Calculated as the positive square root of the average of squared differences from the mean
  • Smaller SD = data is closely clustered around mean
  • Larger SD = data is widely spread

Formula:

SD (sigma) = sqrt( sum((xi - mean)^2) / n )

Step-by-step calculation:

Heights: [90, 102, 110, 115, 85, 90, 100, 110, 110], Mean = 101.33

Height (x) x, mean (x, mean)^2
90 -11.33 128.37
102 0.67 0.45
110 8.67 75.17
115 13.67 186.87
85 -16.33 266.67
90 -11.33 128.37
100 -1.33 1.77
110 8.67 75.17
110 8.67 75.17
Total ~0 938.00

SD = sqrt(938.00 / 9) = sqrt(104.22) = 10.21 cm

Python code to calculate standard deviation:

import math

data = [90, 102, 110, 115, 85, 90, 100, 110, 110]
mean = sum(data) / len(data)
squared_diffs = [(x - mean) ** 2 for x in data]
variance = sum(squared_diffs) / len(data)
std_dev = math.sqrt(variance)
print(f"Standard Deviation = {std_dev:.2f}")  # Standard Deviation = 10.21

Comparison of Range and Standard Deviation:

Feature Range Standard Deviation
Uses Only max and min values All values
Sensitivity to outliers Very sensitive Less sensitive
Information Basic spread Detailed spread
Calculation Simple (max, min) Complex (involves mean, squares, sqrt)

7.9 Choosing the Right Statistical Technique

Problem Statement Suitable Technique
Disparity in salaries of all employees Standard Deviation or Range
Average performance of a class in a test Mean
Compare height of residents of two cities Standard Deviation
Find the dominant value from a set Mode
Compare income of residents of two cities Standard Deviation
Find popular car colour in a city Mode
Middle value of exam scores Median
Spread of temperature readings Range or Standard Deviation

7.10 Data Visualization

  • Bar charts: compare categories (e.g., sales of different products)
  • Pie charts: show proportions (e.g., percentage of market share)
  • Line graphs: show trends over time (e.g., temperature over a week)
  • Histograms: show frequency distribution (e.g., marks distribution), Helps identify patterns, trends, and outliers, Tools: matplotlib (Python), Excel, Tableau

Important Definitions

# Term Definition
1 Data Collection of raw facts, numbers, characters, symbols
2 Information Processed data that has meaning and context
3 Knowledge Understanding derived from analysing information
4 Structured data Data organized in rows and columns (tabular format)
5 Unstructured data Data without predefined format (images, videos, text)
6 Semi-structured data Data with some organizational properties (JSON, XML)
7 Metadata Data about data (e.g., image file size, email subject line)
8 Data processing Converting raw data into meaningful information
9 Census Systematic collection and recording of population data
10 Outlier Exceptionally large or small value that can distort analysis
11 Mean Average of all values (sum / count)
12 Median Middle value when data is sorted
13 Mode Most frequently occurring value
14 Range Difference between maximum and minimum values
15 Standard deviation Measure of spread, square root of average squared deviations from mean
16 Measure of central tendency Single value representing the centre of data (mean, median, mode)
17 Measure of variability Value indicating spread of data (range, standard deviation)

Why This Chapter Matters

Even though not in the current CBSE syllabus, this chapter connects to:

  • AI/ML curriculum being introduced in many CBSE schools
  • Data Science as an elective subject in Class XI/XII
  • NEP 2020 emphasis on computational thinking and data literacy, Foundation for understanding pandas, NumPy, and data analysis in Python

Practice Problems

  1. Identify the type of data (structured/unstructured/semi-structured): a) Recording a video, unstructured b) Marking attendance in a register, structured c) Writing tweets, unstructured d) Filling an online application form, structured e) An XML configuration file, semi-structured

  2. Temperature (in Celsius) of 7 days: 34, 34, 27, 28, 27, 34, 34 a) Mean = (34+34+27+28+27+34+34)/7 = 218/7 = 31.14 b) Range = 34, 27 = 7 c) Mode = 34 (appears 4 times) d) Median = sorted [27, 27, 28, 34, 34, 34, 34] = 34 (4th value)

  3. Write Python code to compute mean, median, mode, and standard deviation for a given list of numbers.

  4. Differentiate between structured and unstructured data with examples.

  5. Explain the data processing cycle with a real-world example.

  6. Why is mean not suitable when data has outliers? Which measure should be used instead?


Key Points Students Miss

  1. Data is NOT the same as information - data is raw; information is processed data with meaning
  2. Mean is affected by outliers, median is not, choose wisely
  3. Mode can work on non-numeric data (e.g., favourite colour) but mean and median cannot
  4. Standard deviation uses ALL values while range uses only two extreme values
  5. Metadata is data about data - not the actual content, but its description
  6. ~80% of world's data is unstructured - images, videos, emails dominate
  7. A dataset can have no mode, one mode, or multiple modes
  8. When computing median, you must sort the data first

Board Exam Tips

  1. For calculation questions, show all intermediate steps, not just the final answer
  2. When asked "which statistical technique to use", always justify your choice
  3. Know the formulas for mean, median, mode, range, and standard deviation
  4. For "differentiate" questions about data types, always include examples with each type
  5. The data processing cycle diagram (Input -> Processing -> Output with Storage) is commonly asked

Test Your Knowledge

Take a quick quiz on this chapter

Start Quiz →

Prefer watching over reading?

Subscribe for free.

Subscribe on YouTube