Chapter 7: Understanding Data

CBSE Unit: NOT in current CBSE syllabus (2025-26) Status: Supplementary, relevant to data science/AI trends in education Priority: LOW for exam, but good foundation for AI/data science content

Key Concepts

7.1 What is Data?

Data is a collection of characters, numbers, and other symbols that represent values, Singular: datum, Plural: data
Computers store data electronically for faster processing compared to manual methods, The ICT revolution (computers, mobile, Internet) has led to generation of large volumes of data at very fast pace, Data by itself cannot help in decision making, it needs to be processed and analysed

Distinction: Data vs Information vs Knowledge

Term	Meaning	Example
Data	Raw, unprocessed facts	85, 90, 78, 92, 88 (marks of 5 students)
Information	Processed data with meaning	Average marks = 86.6, Highest = 92
Knowledge	Understanding derived from information	Class performance is good; focus on weaker students

7.2 Examples of Data

Personal data: name, age, gender, contact details
Transaction data: banking, shopping, ticketing (online or offline)
Media data: images (pixels), video (frames), audio, graphics, animations
Documents and web pages: text content, hyperlinks
Online posts: comments, messages, social media content
Sensor data: signals generated by IoT devices
Satellite data: meteorological data, communication data, earth observation data

7.3 Importance of Data

Data is crucial for decision making across various fields:

Domain	How Data is Used
College admissions	Placement data, faculty qualifications, fees, facilities
Government	Census data for planning and policy formulation
Sports	Analysing opponent team performances for strategy
Banking	Customer accounts, transactions, fraud detection
Elections	Electronic voting machines for recording and counting votes
Science	Recording experimental results, comparing outcomes
Pharmaceutical	Testing medicine effectiveness through clinical data
Libraries	Book inventory, membership management
Search engines	Analysing web data to provide relevant results
Weather	Satellite data analysis for forecasts and alerts
Business	Market analysis, customer feedback, dynamic pricing
Cab services	Demand-based dynamic pricing (surge pricing)
Restaurants	Sales data analysis for "happy hours" discounts

7.4 Types of Data

(A) Structured Data, Organised in a well-defined format (rows and columns), Stored in tabular format, tables, databases, spreadsheets, Each column = attribute/parameter/variable, Each row = observation/record, Easy to process and analyse using standard tools

Example: Kitchen items inventory

ModelNo	ProductName	UnitPrice	Discount(%)	Items_in_Inventory
ABC1	Water bottle	126	8	13
ABC2	Melamine Plates	320	5	45
ABC3	Dinner Set	4200	10	8
GH67	Jug	80	0	10

More examples of structured data:

Entity/Activity	Data Fields (Attributes)
Books at a shop	BookTitle, Author, Price, YearOfPublication
School fees	StudentName, Class, RollNo, FeesAmount, DepositDate
ATM withdrawal	AccHolderName, AccountNo, TypeOfAcc, DateOfWithdrawal, AmountWithdrawn, ATMid

(B) Unstructured Data

No predefined format or fixed structure, Cannot be stored in traditional row-and-column (tabular) format, Much harder to process and analyse than structured data, Examples: images, videos, audio files, emails, social media posts, web pages, news articles, business reports, A newspaper page has no fixed pattern, different number of images, articles, ads each day, An email has no fixed structure, varying number of lines, paragraphs, attachments

Metadata: Unstructured data is often described using metadata (data about data)., Email metadata: subject, recipient, sender, date, attachment count, Image metadata: file size (KB/MB), image type (JPEG, PNG), resolution, date taken, When you click a photograph on your phone, metadata like GPS location, date/time, camera settings is automatically recorded

(C) Semi-structured Data, Has some organizational properties but not as rigid as structured data, Contains tags or markers to separate elements, but no strict tabular format, Examples: JSON, XML, HTML, email headers, log files

Comparison Table:

Feature	Structured	Unstructured	Semi-structured
Format	Fixed (rows/columns)	No fixed format	Partially organized
Storage	Tables, databases	File systems, data lakes	JSON, XML files
Examples	Spreadsheets, SQL databases	Images, videos, emails	JSON, XML, HTML
Processing	Easy with SQL, spreadsheets	Needs special tools (NLP, CV)	Moderate difficulty
Volume	~20% of all data	~80% of all data	Varies

7.5 Data Collection

Data collection means identifying and gathering data from appropriate sources. Data can come from:

Methods of Data Collection:

Method	Description	Example
Manual entry	Data available in diary/register, entered digitally	Shopkeeper enters sales from register into spreadsheet
Already digital	Data already in digital format	CSV file from previous system
Software-generated	Application collects data automatically	POS (Point of Sale) software recording each sale
Surveys/Questionnaires	Primary data collection from people	Google Forms survey for customer feedback
Web scraping	Extracting data from websites	Collecting product prices from e-commerce sites
Sensors/IoT	Automatic data generation by devices	Temperature sensors, fitness trackers
Social media	User-generated content	Posts, comments, likes, shares
Existing databases	Secondary data from organizations	World Bank, IMF economic data

Real-world data collection scenarios:

Hospitals collect patient data for improving services, Shopping malls track items purchased (discovering patterns like "bedsheets and groceries are frequently bought together"), Political analysts analyse social media posts for public opinion, World Bank and IMF collect economic data from countries for forecasting

7.6 Data Storage

Process of storing data on storage devices for future retrieval and use, Huge volumes of data are generated at very high rates, storage is a challenge, Decreasing cost of digital storage has simplified this task

Common Storage Devices:

Device	Type	Typical Capacity
Hard Disk Drive (HDD)	Magnetic	500 GB to 20 TB
Solid State Drive (SSD)	Flash memory	128 GB to 8 TB
CD/DVD	Optical	700 MB / 4.7-8.5 GB
Pen Drive (USB)	Flash memory	8 GB to 512 GB
Memory Card	Flash memory	16 GB to 1 TB
Tape Drive	Magnetic	Up to 30 TB
Cloud Storage	Network-based	Virtually unlimited

Storage formats:

Files: images, documents, audio/video stored as individual files
CSV files: comma-separated values for tabular data
Databases (DBMS): structured storage with efficient retrieval, overcomes limitations of simple file processing

7.7 Data Processing

Data processing converts raw data into meaningful information.

Steps in Data Processing:

Data Collection - gather raw data
Data Preparation/Entry - enter/import data into digital format
Data Classification - organize/categorize data
Processing - apply computations, calculations, transformations
Storage - store processed data for future retrieval
Output - generate results as reports, charts, tables

Data Processing Cycle:

Raw Data (Input) --> Processing --> Information (Output)
                        |
                    Store/Retrieve

Real-world Data Processing Examples:

Scenario	Input	Processing	Output
Exam admit card	Student details, photo, fees	Verify eligibility, generate roll number	Admit card with center details
ATM withdrawal	PIN, account type, amount	Verify PIN, check balance, deduct	Cash + receipt
Train ticket	Journey details, passenger info	Check availability, allocate berth	Ticket with PNR, berth number

Data Cleaning (Pre-processing): Before analysis, data often needs cleaning:

Remove duplicates: same record entered twice
Handle missing values: fill in or remove incomplete records
Fix errors: typos, incorrect entries
Standardize formats: dates (DD/MM/YYYY vs MM/DD/YYYY), units

7.8 Statistical Techniques for Data Processing

Statistical techniques help us summarise and understand data. They are divided into:

7.8.1 Measures of Central Tendency

A measure of central tendency is a single value that gives us some idea about the centre of the data.

(A) Mean (Average)

Sum of all values divided by the number of values, Formula: Mean = (x1 + x2 + ... + xn) / n
Sensitive to outliers - one extreme value can significantly change the mean

Example: Heights (in cm) = [90, 102, 110, 115, 85, 90, 100, 110, 110]
Mean = (90 + 102 + 110 + 115 + 85 + 90 + 100 + 110 + 110) / 9
     = 912 / 9
     = 101.33 cm

Effect of outliers on mean:

Original data:  [10, 12, 14, 11, 13]    Mean = 12.0
With outlier:   [10, 12, 14, 11, 13, 100]  Mean = 26.67  (misleading!)

The outlier (100) drastically changes the mean. Remove outliers before computing mean.

Python code to calculate mean:

data = [90, 102, 110, 115, 85, 90, 100, 110, 110]
mean = sum(data) / len(data)
print(f"Mean = {mean:.2f}")  # Mean = 101.33

(B) Median (Middle Value)

When all values are sorted in ascending/descending order, the middle value is the median
Odd number of values: median = middle value
Even number of values: median = average of two middle values
Not affected by outliers - better than mean for skewed data

Example (odd count):
Sorted: [85, 90, 90, 100, 102, 110, 110, 110, 115]  (9 values)
Median = value at position 5 = 102 cm

Example (even count):
Data: [3, 7, 8, 12, 14, 18]  (6 values)
Median = (8 + 12) / 2 = 10.0

Python code to calculate median:

data = [90, 102, 110, 115, 85, 90, 100, 110, 110]
data_sorted = sorted(data)
n = len(data_sorted)
if n % 2 == 1:
    median = data_sorted[n // 2]
else:
    median = (data_sorted[n // 2 - 1] + data_sorted[n // 2]) / 2
print(f"Median = {median}")  # Median = 102

(C) Mode (Most Frequent)

Value that appears the most number of times in the data, A dataset can have no mode (all values unique), one mode, or multiple modes
Can be found for both numeric and non-numeric data (e.g., most popular car colour)

Example:
Heights: [85, 90, 90, 100, 102, 110, 110, 110, 115]
Mode = 110 (appears 3 times -  highest frequency)

Example (no mode):
Data: [5, 8, 12, 3, 7]   (each value appears once -  no mode)

Example (multiple modes):
Data: [1, 2, 2, 3, 3, 4]  (both 2 and 3 appear twice -  bimodal)

Python code to calculate mode:

data = [90, 102, 110, 115, 85, 90, 100, 110, 110]
from collections import Counter
freq = Counter(data)
max_count = max(freq.values())
modes = [val for val, count in freq.items() if count == max_count]
print(f"Mode = {modes}")  # Mode = [110]

When to use which measure:

Measure	Best Used When	Not Good For
Mean	Data is evenly distributed, no extreme values	Data with outliers
Median	Data has outliers or is skewed	Categorical data
Mode	Finding most common/popular value	Data where all values are unique

7.8.2 Measures of Variability (Dispersion)

Measures of variability describe the spread or variation of values around the mean. Two datasets can have the same mean but very different spreads.

(A) Range

Difference between maximum and minimum values, Formula: Range = Maximum, Minimum, Calculated only for numerical data
Tells about the coverage/spread of data
Sensitive to outliers (uses only two extreme values)

Example:
Heights: [85, 90, 90, 100, 102, 110, 110, 110, 115]
Range = 115 - 85 = 30 cm

Salaries: [25000, 28000, 30000, 35000, 500000]
Range = 500000 - 25000 = 475000 (misleading due to outlier!)

(B) Standard Deviation

Measures spread of data using all values (not just extremes like Range), Calculated as the positive square root of the average of squared differences from the mean
Smaller SD = data is closely clustered around mean
Larger SD = data is widely spread

Formula:

SD (sigma) = sqrt( sum((xi - mean)^2) / n )

Step-by-step calculation:

Heights: [90, 102, 110, 115, 85, 90, 100, 110, 110], Mean = 101.33

Height (x)	x, mean	(x, mean)^2
90	-11.33	128.37
102	0.67	0.45
110	8.67	75.17
115	13.67	186.87
85	-16.33	266.67
90	-11.33	128.37
100	-1.33	1.77
110	8.67	75.17
110	8.67	75.17
Total	~0	938.00

SD = sqrt(938.00 / 9) = sqrt(104.22) = 10.21 cm

Python code to calculate standard deviation:

import math

data = [90, 102, 110, 115, 85, 90, 100, 110, 110]
mean = sum(data) / len(data)
squared_diffs = [(x - mean) ** 2 for x in data]
variance = sum(squared_diffs) / len(data)
std_dev = math.sqrt(variance)
print(f"Standard Deviation = {std_dev:.2f}")  # Standard Deviation = 10.21

Comparison of Range and Standard Deviation:

Feature	Range	Standard Deviation
Uses	Only max and min values	All values
Sensitivity to outliers	Very sensitive	Less sensitive
Information	Basic spread	Detailed spread
Calculation	Simple (max, min)	Complex (involves mean, squares, sqrt)

7.9 Choosing the Right Statistical Technique

Problem Statement	Suitable Technique
Disparity in salaries of all employees	Standard Deviation or Range
Average performance of a class in a test	Mean
Compare height of residents of two cities	Standard Deviation
Find the dominant value from a set	Mode
Compare income of residents of two cities	Standard Deviation
Find popular car colour in a city	Mode
Middle value of exam scores	Median
Spread of temperature readings	Range or Standard Deviation

7.10 Data Visualization

Bar charts: compare categories (e.g., sales of different products)
Pie charts: show proportions (e.g., percentage of market share)
Line graphs: show trends over time (e.g., temperature over a week)
Histograms: show frequency distribution (e.g., marks distribution), Helps identify patterns, trends, and outliers, Tools: matplotlib (Python), Excel, Tableau

Important Definitions

#	Term	Definition
1	Data	Collection of raw facts, numbers, characters, symbols
2	Information	Processed data that has meaning and context
3	Knowledge	Understanding derived from analysing information
4	Structured data	Data organized in rows and columns (tabular format)
5	Unstructured data	Data without predefined format (images, videos, text)
6	Semi-structured data	Data with some organizational properties (JSON, XML)
7	Metadata	Data about data (e.g., image file size, email subject line)
8	Data processing	Converting raw data into meaningful information
9	Census	Systematic collection and recording of population data
10	Outlier	Exceptionally large or small value that can distort analysis
11	Mean	Average of all values (sum / count)
12	Median	Middle value when data is sorted
13	Mode	Most frequently occurring value
14	Range	Difference between maximum and minimum values
15	Standard deviation	Measure of spread, square root of average squared deviations from mean
16	Measure of central tendency	Single value representing the centre of data (mean, median, mode)
17	Measure of variability	Value indicating spread of data (range, standard deviation)

Why This Chapter Matters

Even though not in the current CBSE syllabus, this chapter connects to:

AI/ML curriculum being introduced in many CBSE schools
Data Science as an elective subject in Class XI/XII
NEP 2020 emphasis on computational thinking and data literacy, Foundation for understanding pandas, NumPy, and data analysis in Python

Practice Problems

Identify the type of data (structured/unstructured/semi-structured): a) Recording a video, unstructured b) Marking attendance in a register, structured c) Writing tweets, unstructured d) Filling an online application form, structured e) An XML configuration file, semi-structured
Temperature (in Celsius) of 7 days: 34, 34, 27, 28, 27, 34, 34 a) Mean = (34+34+27+28+27+34+34)/7 = 218/7 = 31.14 b) Range = 34, 27 = 7 c) Mode = 34 (appears 4 times) d) Median = sorted [27, 27, 28, 34, 34, 34, 34] = 34 (4th value)
Write Python code to compute mean, median, mode, and standard deviation for a given list of numbers.
Differentiate between structured and unstructured data with examples.
Explain the data processing cycle with a real-world example.
Why is mean not suitable when data has outliers? Which measure should be used instead?

Key Points Students Miss

Data is NOT the same as information - data is raw; information is processed data with meaning
Mean is affected by outliers, median is not, choose wisely
Mode can work on non-numeric data (e.g., favourite colour) but mean and median cannot
Standard deviation uses ALL values while range uses only two extreme values
Metadata is data about data - not the actual content, but its description
~80% of world's data is unstructured - images, videos, emails dominate
A dataset can have no mode, one mode, or multiple modes
When computing median, you must sort the data first

Board Exam Tips

For calculation questions, show all intermediate steps, not just the final answer
When asked "which statistical technique to use", always justify your choice
Know the formulas for mean, median, mode, range, and standard deviation
For "differentiate" questions about data types, always include examples with each type
The data processing cycle diagram (Input -> Processing -> Output with Storage) is commonly asked