Chapter 7: Understanding Data
CBSE Unit: NOT in current CBSE syllabus (2025-26) Status: Supplementary, relevant to data science/AI trends in education Priority: LOW for exam, but good foundation for AI/data science content
Key Concepts
7.1 What is Data?
- Data is a collection of characters, numbers, and other symbols that represent values, Singular: datum, Plural: data
- Computers store data electronically for faster processing compared to manual methods, The ICT revolution (computers, mobile, Internet) has led to generation of large volumes of data at very fast pace, Data by itself cannot help in decision making, it needs to be processed and analysed
Distinction: Data vs Information vs Knowledge
| Term | Meaning | Example |
|---|---|---|
| Data | Raw, unprocessed facts | 85, 90, 78, 92, 88 (marks of 5 students) |
| Information | Processed data with meaning | Average marks = 86.6, Highest = 92 |
| Knowledge | Understanding derived from information | Class performance is good; focus on weaker students |
7.2 Examples of Data
- Personal data: name, age, gender, contact details
- Transaction data: banking, shopping, ticketing (online or offline)
- Media data: images (pixels), video (frames), audio, graphics, animations
- Documents and web pages: text content, hyperlinks
- Online posts: comments, messages, social media content
- Sensor data: signals generated by IoT devices
- Satellite data: meteorological data, communication data, earth observation data
7.3 Importance of Data
Data is crucial for decision making across various fields:
| Domain | How Data is Used |
|---|---|
| College admissions | Placement data, faculty qualifications, fees, facilities |
| Government | Census data for planning and policy formulation |
| Sports | Analysing opponent team performances for strategy |
| Banking | Customer accounts, transactions, fraud detection |
| Elections | Electronic voting machines for recording and counting votes |
| Science | Recording experimental results, comparing outcomes |
| Pharmaceutical | Testing medicine effectiveness through clinical data |
| Libraries | Book inventory, membership management |
| Search engines | Analysing web data to provide relevant results |
| Weather | Satellite data analysis for forecasts and alerts |
| Business | Market analysis, customer feedback, dynamic pricing |
| Cab services | Demand-based dynamic pricing (surge pricing) |
| Restaurants | Sales data analysis for "happy hours" discounts |
7.4 Types of Data
(A) Structured Data, Organised in a well-defined format (rows and columns), Stored in tabular format, tables, databases, spreadsheets, Each column = attribute/parameter/variable, Each row = observation/record, Easy to process and analyse using standard tools
Example: Kitchen items inventory
| ModelNo | ProductName | UnitPrice | Discount(%) | Items_in_Inventory |
|---|---|---|---|---|
| ABC1 | Water bottle | 126 | 8 | 13 |
| ABC2 | Melamine Plates | 320 | 5 | 45 |
| ABC3 | Dinner Set | 4200 | 10 | 8 |
| GH67 | Jug | 80 | 0 | 10 |
More examples of structured data:
| Entity/Activity | Data Fields (Attributes) |
|---|---|
| Books at a shop | BookTitle, Author, Price, YearOfPublication |
| School fees | StudentName, Class, RollNo, FeesAmount, DepositDate |
| ATM withdrawal | AccHolderName, AccountNo, TypeOfAcc, DateOfWithdrawal, AmountWithdrawn, ATMid |
(B) Unstructured Data
- No predefined format or fixed structure, Cannot be stored in traditional row-and-column (tabular) format, Much harder to process and analyse than structured data, Examples: images, videos, audio files, emails, social media posts, web pages, news articles, business reports, A newspaper page has no fixed pattern, different number of images, articles, ads each day, An email has no fixed structure, varying number of lines, paragraphs, attachments
Metadata: Unstructured data is often described using metadata (data about data)., Email metadata: subject, recipient, sender, date, attachment count, Image metadata: file size (KB/MB), image type (JPEG, PNG), resolution, date taken, When you click a photograph on your phone, metadata like GPS location, date/time, camera settings is automatically recorded
(C) Semi-structured Data, Has some organizational properties but not as rigid as structured data, Contains tags or markers to separate elements, but no strict tabular format, Examples: JSON, XML, HTML, email headers, log files
Comparison Table:
| Feature | Structured | Unstructured | Semi-structured |
|---|---|---|---|
| Format | Fixed (rows/columns) | No fixed format | Partially organized |
| Storage | Tables, databases | File systems, data lakes | JSON, XML files |
| Examples | Spreadsheets, SQL databases | Images, videos, emails | JSON, XML, HTML |
| Processing | Easy with SQL, spreadsheets | Needs special tools (NLP, CV) | Moderate difficulty |
| Volume | ~20% of all data | ~80% of all data | Varies |
7.5 Data Collection
Data collection means identifying and gathering data from appropriate sources. Data can come from:
Methods of Data Collection:
| Method | Description | Example |
|---|---|---|
| Manual entry | Data available in diary/register, entered digitally | Shopkeeper enters sales from register into spreadsheet |
| Already digital | Data already in digital format | CSV file from previous system |
| Software-generated | Application collects data automatically | POS (Point of Sale) software recording each sale |
| Surveys/Questionnaires | Primary data collection from people | Google Forms survey for customer feedback |
| Web scraping | Extracting data from websites | Collecting product prices from e-commerce sites |
| Sensors/IoT | Automatic data generation by devices | Temperature sensors, fitness trackers |
| Social media | User-generated content | Posts, comments, likes, shares |
| Existing databases | Secondary data from organizations | World Bank, IMF economic data |
Real-world data collection scenarios:
- Hospitals collect patient data for improving services, Shopping malls track items purchased (discovering patterns like "bedsheets and groceries are frequently bought together"), Political analysts analyse social media posts for public opinion, World Bank and IMF collect economic data from countries for forecasting
7.6 Data Storage
- Process of storing data on storage devices for future retrieval and use, Huge volumes of data are generated at very high rates, storage is a challenge, Decreasing cost of digital storage has simplified this task
Common Storage Devices:
| Device | Type | Typical Capacity |
|---|---|---|
| Hard Disk Drive (HDD) | Magnetic | 500 GB to 20 TB |
| Solid State Drive (SSD) | Flash memory | 128 GB to 8 TB |
| CD/DVD | Optical | 700 MB / 4.7-8.5 GB |
| Pen Drive (USB) | Flash memory | 8 GB to 512 GB |
| Memory Card | Flash memory | 16 GB to 1 TB |
| Tape Drive | Magnetic | Up to 30 TB |
| Cloud Storage | Network-based | Virtually unlimited |
Storage formats:
- Files: images, documents, audio/video stored as individual files
- CSV files: comma-separated values for tabular data
- Databases (DBMS): structured storage with efficient retrieval, overcomes limitations of simple file processing
7.7 Data Processing
Data processing converts raw data into meaningful information.
Steps in Data Processing:
- Data Collection - gather raw data
- Data Preparation/Entry - enter/import data into digital format
- Data Classification - organize/categorize data
- Processing - apply computations, calculations, transformations
- Storage - store processed data for future retrieval
- Output - generate results as reports, charts, tables
Data Processing Cycle:
Raw Data (Input) --> Processing --> Information (Output)
|
Store/Retrieve
Real-world Data Processing Examples:
| Scenario | Input | Processing | Output |
|---|---|---|---|
| Exam admit card | Student details, photo, fees | Verify eligibility, generate roll number | Admit card with center details |
| ATM withdrawal | PIN, account type, amount | Verify PIN, check balance, deduct | Cash + receipt |
| Train ticket | Journey details, passenger info | Check availability, allocate berth | Ticket with PNR, berth number |
Data Cleaning (Pre-processing): Before analysis, data often needs cleaning:
- Remove duplicates: same record entered twice
- Handle missing values: fill in or remove incomplete records
- Fix errors: typos, incorrect entries
- Standardize formats: dates (DD/MM/YYYY vs MM/DD/YYYY), units
7.8 Statistical Techniques for Data Processing
Statistical techniques help us summarise and understand data. They are divided into:
7.8.1 Measures of Central Tendency
A measure of central tendency is a single value that gives us some idea about the centre of the data.
(A) Mean (Average)
- Sum of all values divided by the number of values, Formula: Mean = (x1 + x2 + ... + xn) / n
- Sensitive to outliers - one extreme value can significantly change the mean
Example: Heights (in cm) = [90, 102, 110, 115, 85, 90, 100, 110, 110]
Mean = (90 + 102 + 110 + 115 + 85 + 90 + 100 + 110 + 110) / 9
= 912 / 9
= 101.33 cm
Effect of outliers on mean:
Original data: [10, 12, 14, 11, 13] Mean = 12.0
With outlier: [10, 12, 14, 11, 13, 100] Mean = 26.67 (misleading!)
The outlier (100) drastically changes the mean. Remove outliers before computing mean.
Python code to calculate mean:
data = [90, 102, 110, 115, 85, 90, 100, 110, 110]
mean = sum(data) / len(data)
print(f"Mean = {mean:.2f}") # Mean = 101.33
(B) Median (Middle Value)
- When all values are sorted in ascending/descending order, the middle value is the median
- Odd number of values: median = middle value
- Even number of values: median = average of two middle values
- Not affected by outliers - better than mean for skewed data
Example (odd count):
Sorted: [85, 90, 90, 100, 102, 110, 110, 110, 115] (9 values)
Median = value at position 5 = 102 cm
Example (even count):
Data: [3, 7, 8, 12, 14, 18] (6 values)
Median = (8 + 12) / 2 = 10.0
Python code to calculate median:
data = [90, 102, 110, 115, 85, 90, 100, 110, 110]
data_sorted = sorted(data)
n = len(data_sorted)
if n % 2 == 1:
median = data_sorted[n // 2]
else:
median = (data_sorted[n // 2 - 1] + data_sorted[n // 2]) / 2
print(f"Median = {median}") # Median = 102
(C) Mode (Most Frequent)
- Value that appears the most number of times in the data, A dataset can have no mode (all values unique), one mode, or multiple modes
- Can be found for both numeric and non-numeric data (e.g., most popular car colour)
Example:
Heights: [85, 90, 90, 100, 102, 110, 110, 110, 115]
Mode = 110 (appears 3 times - highest frequency)
Example (no mode):
Data: [5, 8, 12, 3, 7] (each value appears once - no mode)
Example (multiple modes):
Data: [1, 2, 2, 3, 3, 4] (both 2 and 3 appear twice - bimodal)
Python code to calculate mode:
data = [90, 102, 110, 115, 85, 90, 100, 110, 110]
from collections import Counter
freq = Counter(data)
max_count = max(freq.values())
modes = [val for val, count in freq.items() if count == max_count]
print(f"Mode = {modes}") # Mode = [110]
When to use which measure:
| Measure | Best Used When | Not Good For |
|---|---|---|
| Mean | Data is evenly distributed, no extreme values | Data with outliers |
| Median | Data has outliers or is skewed | Categorical data |
| Mode | Finding most common/popular value | Data where all values are unique |
7.8.2 Measures of Variability (Dispersion)
Measures of variability describe the spread or variation of values around the mean. Two datasets can have the same mean but very different spreads.
(A) Range
- Difference between maximum and minimum values, Formula: Range = Maximum, Minimum, Calculated only for numerical data
- Tells about the coverage/spread of data
- Sensitive to outliers (uses only two extreme values)
Example:
Heights: [85, 90, 90, 100, 102, 110, 110, 110, 115]
Range = 115 - 85 = 30 cm
Salaries: [25000, 28000, 30000, 35000, 500000]
Range = 500000 - 25000 = 475000 (misleading due to outlier!)
(B) Standard Deviation
- Measures spread of data using all values (not just extremes like Range), Calculated as the positive square root of the average of squared differences from the mean
- Smaller SD = data is closely clustered around mean
- Larger SD = data is widely spread
Formula:
SD (sigma) = sqrt( sum((xi - mean)^2) / n )
Step-by-step calculation:
Heights: [90, 102, 110, 115, 85, 90, 100, 110, 110], Mean = 101.33
| Height (x) | x, mean | (x, mean)^2 |
|---|---|---|
| 90 | -11.33 | 128.37 |
| 102 | 0.67 | 0.45 |
| 110 | 8.67 | 75.17 |
| 115 | 13.67 | 186.87 |
| 85 | -16.33 | 266.67 |
| 90 | -11.33 | 128.37 |
| 100 | -1.33 | 1.77 |
| 110 | 8.67 | 75.17 |
| 110 | 8.67 | 75.17 |
| Total | ~0 | 938.00 |
SD = sqrt(938.00 / 9) = sqrt(104.22) = 10.21 cm
Python code to calculate standard deviation:
import math
data = [90, 102, 110, 115, 85, 90, 100, 110, 110]
mean = sum(data) / len(data)
squared_diffs = [(x - mean) ** 2 for x in data]
variance = sum(squared_diffs) / len(data)
std_dev = math.sqrt(variance)
print(f"Standard Deviation = {std_dev:.2f}") # Standard Deviation = 10.21
Comparison of Range and Standard Deviation:
| Feature | Range | Standard Deviation |
|---|---|---|
| Uses | Only max and min values | All values |
| Sensitivity to outliers | Very sensitive | Less sensitive |
| Information | Basic spread | Detailed spread |
| Calculation | Simple (max, min) | Complex (involves mean, squares, sqrt) |
7.9 Choosing the Right Statistical Technique
| Problem Statement | Suitable Technique |
|---|---|
| Disparity in salaries of all employees | Standard Deviation or Range |
| Average performance of a class in a test | Mean |
| Compare height of residents of two cities | Standard Deviation |
| Find the dominant value from a set | Mode |
| Compare income of residents of two cities | Standard Deviation |
| Find popular car colour in a city | Mode |
| Middle value of exam scores | Median |
| Spread of temperature readings | Range or Standard Deviation |
7.10 Data Visualization
- Bar charts: compare categories (e.g., sales of different products)
- Pie charts: show proportions (e.g., percentage of market share)
- Line graphs: show trends over time (e.g., temperature over a week)
- Histograms: show frequency distribution (e.g., marks distribution), Helps identify patterns, trends, and outliers, Tools: matplotlib (Python), Excel, Tableau
Important Definitions
| # | Term | Definition |
|---|---|---|
| 1 | Data | Collection of raw facts, numbers, characters, symbols |
| 2 | Information | Processed data that has meaning and context |
| 3 | Knowledge | Understanding derived from analysing information |
| 4 | Structured data | Data organized in rows and columns (tabular format) |
| 5 | Unstructured data | Data without predefined format (images, videos, text) |
| 6 | Semi-structured data | Data with some organizational properties (JSON, XML) |
| 7 | Metadata | Data about data (e.g., image file size, email subject line) |
| 8 | Data processing | Converting raw data into meaningful information |
| 9 | Census | Systematic collection and recording of population data |
| 10 | Outlier | Exceptionally large or small value that can distort analysis |
| 11 | Mean | Average of all values (sum / count) |
| 12 | Median | Middle value when data is sorted |
| 13 | Mode | Most frequently occurring value |
| 14 | Range | Difference between maximum and minimum values |
| 15 | Standard deviation | Measure of spread, square root of average squared deviations from mean |
| 16 | Measure of central tendency | Single value representing the centre of data (mean, median, mode) |
| 17 | Measure of variability | Value indicating spread of data (range, standard deviation) |
Why This Chapter Matters
Even though not in the current CBSE syllabus, this chapter connects to:
- AI/ML curriculum being introduced in many CBSE schools
- Data Science as an elective subject in Class XI/XII
- NEP 2020 emphasis on computational thinking and data literacy, Foundation for understanding pandas, NumPy, and data analysis in Python
Practice Problems
-
Identify the type of data (structured/unstructured/semi-structured): a) Recording a video, unstructured b) Marking attendance in a register, structured c) Writing tweets, unstructured d) Filling an online application form, structured e) An XML configuration file, semi-structured
-
Temperature (in Celsius) of 7 days: 34, 34, 27, 28, 27, 34, 34 a) Mean = (34+34+27+28+27+34+34)/7 = 218/7 = 31.14 b) Range = 34, 27 = 7 c) Mode = 34 (appears 4 times) d) Median = sorted [27, 27, 28, 34, 34, 34, 34] = 34 (4th value)
-
Write Python code to compute mean, median, mode, and standard deviation for a given list of numbers.
-
Differentiate between structured and unstructured data with examples.
-
Explain the data processing cycle with a real-world example.
-
Why is mean not suitable when data has outliers? Which measure should be used instead?
Key Points Students Miss
- Data is NOT the same as information - data is raw; information is processed data with meaning
- Mean is affected by outliers, median is not, choose wisely
- Mode can work on non-numeric data (e.g., favourite colour) but mean and median cannot
- Standard deviation uses ALL values while range uses only two extreme values
- Metadata is data about data - not the actual content, but its description
- ~80% of world's data is unstructured - images, videos, emails dominate
- A dataset can have no mode, one mode, or multiple modes
- When computing median, you must sort the data first
Board Exam Tips
- For calculation questions, show all intermediate steps, not just the final answer
- When asked "which statistical technique to use", always justify your choice
- Know the formulas for mean, median, mode, range, and standard deviation
- For "differentiate" questions about data types, always include examples with each type
- The data processing cycle diagram (Input -> Processing -> Output with Storage) is commonly asked
Prefer watching over reading?
Subscribe for free.