Computer Organization & Architecture
Machine instructions, pipelining, memory hierarchy, I/O.
Why this unit matters
Computer Organization and Architecture (CO&A) is among the most numerically heavy units in GATE CS. Pipelining speedup, cache hit-time calculations, and effective memory access time (EMAT) questions appear almost every year and are worth 2 marks each. The concepts, how instructions execute, how memory is organised, how pipelines work, also underpin OS, networks, and compiler design questions. Getting CO&A right means you can pick up marks in other units too.
Expect 8-10 marks from: instruction set design, pipeline execution timelines and hazards, cache mapping and miss/hit calculations, and interrupt vs DMA I/O comparisons.
Syllabus map
-
Machine Instructions and Addressing Modes
- Instruction format: opcode, operand, address fields
- Addressing modes: immediate, direct, indirect, register, register-indirect, displacement (indexed, based), relative, auto-increment/decrement
- Instruction types: data transfer, arithmetic/logic, branch, I/O
-
ALU, Data-path and Control Unit
- ALU design: arithmetic operations, logical operations, shifters
- Datapath components: registers, buses, ALU, MAR, MDR, PC, IR
- Control unit: hardwired vs microprogrammed
- Micro-operations and register transfer language (RTL)
-
Instruction Pipelining
- Stages: IF, ID, EX, MEM, WB (5-stage RISC pipeline)
- Pipeline performance: throughput, speedup, efficiency
- Pipeline hazards: structural, data (RAW, WAR, WAW), control
- Hazard mitigation: stalls/bubbles, data forwarding, branch prediction
-
Memory Hierarchy
- Cache memory: direct mapped, set-associative, fully associative
- Cache parameters: block size, number of sets, associativity
- Replacement policies: LRU, FIFO, optimal
- Write policies: write-through, write-back, write-allocate
- Effective Memory Access Time (EMAT)
- Virtual memory: paging, TLB, page tables
- Secondary storage: magnetic disk, seek time, rotational latency, transfer time
-
I/O Interface
- Programmed I/O (polling)
- Interrupt-driven I/O: interrupt service routine (ISR), priority, vectored interrupts
- Direct Memory Access (DMA): cycle stealing, burst mode
Machine Instructions and Addressing Modes
Instruction Format
An instruction typically contains:
- Opcode field: specifies the operation
- Operand/address field(s): specifies where data lives
If an instruction has k address bits and memory has 2^k locations, a single address field can access the entire memory. Modern ISAs (RISC-V, ARM) use fixed-length 32-bit instructions for simplicity of decode; x86 uses variable-length for code density.
Addressing Modes
| Mode | Effective Address / Value | Notes |
|---|---|---|
| Immediate | operand IS the value | fastest; no memory access |
| Direct | EA = address field | one memory access |
| Indirect | EA = M[address field] | two memory accesses |
| Register | operand in register | no memory access |
| Register Indirect | EA = contents of register | one memory access |
| Displacement (Indexed) | EA = base register + offset | array access |
| Relative | EA = PC + offset | branch instructions |
| Auto-increment | EA = register; then register++ | stack/array traversal |
Typical GATE trap: Indirect addressing requires TWO memory accesses to fetch the actual data (one to get the address, one to get the data). Students often count only one.
Worked example: An instruction uses register-indirect addressing. Register R1 = 2000, M[2000] = 42. What value is fetched?
Effective address = R1 = 2000. Value fetched = M[2000] = 42. If it were indirect (non-register) with address field = 2000, we'd first go to M[2000] = 42, then to M[42].
ALU and Control Unit
Datapath
Key registers in a basic datapath:
- PC (Program Counter): address of next instruction
- IR (Instruction Register): currently executing instruction
- MAR (Memory Address Register): address to read/write
- MDR (Memory Data Register): data read from or to be written to memory
- ACC (Accumulator): ALU operand/result in accumulator architectures
Control Unit Styles
Hardwired control: Boolean logic circuits directly generate control signals from opcode and step counter. Fast but inflexible, changing the ISA requires redesigning the circuit.
Microprogrammed control: Each machine instruction is implemented as a microprogram (sequence of micro-operations stored in a control memory). Slower but easy to modify.
Typical GATE trap: Microprogrammed control is NOT faster. It introduces an extra level of indirection (accessing control memory) but is easier to design and modify.
Instruction Pipelining
Five-Stage Pipeline
IF → ID → EX → MEM → WB
- IF: Instruction Fetch from memory using PC
- ID: Instruction Decode and register file read
- EX: Execute ALU operation or compute effective address
- MEM: Memory access (load/store)
- WB: Write result back to register file
Pipeline performance formulas:
Without pipeline: time for n instructions = n * k * t (k stages, t = stage time). With ideal pipeline: time = (k + n, 1) * t. Speedup = (n * k * t) / ((k + n, 1) * t) = nk / (k + n, 1). As n → infinity, speedup → k (one instruction per stage per cycle).
Worked example: A 5-stage pipeline with stage time 10 ns executes 100 instructions. What is the speedup over a non-pipelined processor?
Non-pipelined: 100 * 5 * 10 = 5000 ns. Pipelined: (5 + 100, 1) * 10 = 104 * 10 = 1040 ns. Speedup = 5000 / 1040 ≈ 4.81.
Typical GATE trap: Speedup is NOT equal to the number of stages because of pipeline fill/drain overhead. For large n it approaches k, but for small n it is noticeably less.
Pipeline Hazards
Structural hazard: Two instructions need the same hardware resource in the same cycle. Solution: duplicate resources (e.g., separate instruction and data caches) or stall.
Data hazards (most common in GATE):
- RAW (Read After Write): instruction j reads a value that instruction i has not yet written. This is a true dependency.
- WAR (Write After Read): instruction j writes a location that instruction i has not yet read. Anti-dependency.
- WAW (Write After Write): instruction j writes to the same location as instruction i. Output dependency.
RAW is the only hazard that causes real problems in a standard in-order pipeline. WAR and WAW appear in out-of-order processors.
Mitigation:
- Stalling (inserting bubbles): pause pipeline until the result is available. Simple but slow.
- Data forwarding (bypassing): route result directly from EX/MEM stage output to EX stage input without waiting for WB. Eliminates most RAW stalls.
- Compiler scheduling: reorder independent instructions to fill delay slots.
Worked example, Stall count: Instructions: I1: ADD R1, R2, R3 then I2: SUB R4, R1, R5. In a 5-stage pipeline with no forwarding, how many stall cycles does the RAW hazard cause?
I1 writes R1 at the end of WB (stage 5). I2 needs R1 at the start of EX (stage 3). If I2 starts one cycle after I1:
- I1 WB at cycle 5; I2 EX needed at cycle 4, 1 stall required? Let us be precise.
Cycle: 1 2 3 4 5 6 7
I1: IF ID EX MEM WB
I2: IF ID **stall** **stall** EX MEM WB
Without forwarding, I1 writes R1 at end of cycle 5. I2 needs R1 at start of EX. With 1 stall after ID, I2 EX is at cycle 6, but R1 is written at end of cycle 5, so it is available. Actually 2 stall cycles are needed in a classic 5-stage pipeline without forwarding (the result is in WB at the end of cycle 5; EX reads at the beginning of a cycle, so I2 EX can happen at cycle 6 with 2 inserted stalls after I2's ID).
With forwarding: forward from EX/MEM boundary of I1 to EX input of I2, 0 stalls.
Control hazards: Branch target is unknown until EX or MEM stage. Solutions: branch prediction (static: predict not taken; dynamic: branch history table), delayed branching (execute instruction after branch regardless).
Memory Hierarchy
Cache Organisation
Direct mapped: Each main memory block maps to exactly one cache line. Index = (block address) mod (number of cache lines). Simple but suffers from conflict misses.
Fully associative: A block can go into any cache line. Maximum flexibility, but requires searching all tags, expensive.
Set associative (k-way): Cache is divided into sets of k lines each. A block maps to exactly one set but can go into any of the k lines within that set. Balances flexibility and cost.
Address partitioning (direct mapped, block size B, C lines):
- Block offset bits = log2(B), Index bits = log2(C), Tag bits = remaining address bits
Effective Memory Access Time (EMAT):
EMAT = hit_rate * cache_time + (1, hit_rate) * (cache_time + memory_time)
or if memory access overlaps (simultaneous):
EMAT = hit_rate * cache_time + miss_rate * memory_time
Read the question carefully, GATE has used both models.
Worked example: Cache access time = 10 ns, main memory = 100 ns, hit rate = 90%. What is EMAT (non-overlapping model)?
EMAT = 0.9 * 10 + 0.1 * (10 + 100) = 9 + 11 = 20 ns.
Replacement policies:
- LRU (Least Recently Used): replace the line not used for the longest time. Optimal in practice but requires tracking usage order.
- FIFO: replace the line that arrived earliest. Can suffer Belady's anomaly.
- Optimal: replace the line needed farthest in the future. Only theoretical, requires future knowledge.
Typical GATE trap: Belady's anomaly (more frames → more page faults) occurs with FIFO but NOT with LRU.
Virtual Memory and TLB
The Translation Lookaside Buffer (TLB) caches page table entries. EMAT with TLB:
EMAT = TLB_hit_rate * (TLB_time + memory_time)
+ TLB_miss_rate * (TLB_time + page_table_access + memory_time)
If the page table is itself in cache, add cache access times accordingly.
Worked example, EMAT with TLB: TLB access = 20 ns (always), TLB hit rate = 90%, main memory = 200 ns. On a TLB miss, one additional memory access for page table.
EMAT = 0.9 * (20 + 200) + 0.1 * (20 + 200 + 200) = 0.9 * 220 + 0.1 * 420 = 198 + 42 = 240 ns.
Secondary Storage
Disk access time = seek time + rotational latency + transfer time.
- Seek time: time to move read/write head to correct track. Given or average = T_seek., Rotational latency: average = half the time for one full rotation = 1 / (2 * RPM / 60) seconds., Transfer time = (data size) / (transfer rate) = (sector bytes) / (bytes_per_track * RPM / 60).
Worked example: A disk rotates at 6000 RPM, average seek time = 5 ms, sector size = 512 bytes, track capacity = 50 KB. What is average disk access time for one sector?
Rotational latency = 1 / (2 * 6000/60) = 1 / 200 = 5 ms. Transfer time = 512 / (501024 * 100) = ... let's compute: 50 KB/track, 100 rotations/sec (6000/60), so transfer rate = 501024*100 = 5,120,000 bytes/sec. Transfer time = 512 / 5,120,000 ≈ 0.1 ms. Total ≈ 5 + 5 + 0.1 = 10.1 ms.
I/O Interface
Programmed I/O (Polling)
The CPU continuously checks the status register of the I/O device to see if it is ready. Simple to implement but wastes CPU cycles, the CPU is busy-waiting.
Interrupt-Driven I/O
The CPU initiates an I/O transfer and continues executing other instructions. When the device is ready, it sends an interrupt signal. The CPU suspends current execution, saves state (pushes PC and flags), and executes the Interrupt Service Routine (ISR). After ISR, CPU restores state and resumes.
Priority interrupts: Multiple devices may interrupt. A priority encoder or daisy-chain selects the highest-priority interrupt. Vectored interrupts provide an interrupt vector (starting address of ISR) directly.
Direct Memory Access (DMA)
DMA allows an I/O device to transfer data directly to/from main memory without CPU intervention. The CPU programs the DMA controller with source, destination, and count, then does other work. DMA "steals" memory bus cycles from the CPU (cycle stealing) or takes the bus for an entire block transfer (burst mode). The CPU is interrupted only at the end of the transfer.
Typical GATE trap: DMA does NOT eliminate CPU involvement entirely, the CPU still sets up the DMA transfer and handles the completion interrupt. DMA reduces, not eliminates, CPU overhead.
Worked Examples
Example 1 (Addressing modes): An instruction is 16 bits: 4-bit opcode, 12-bit address field. Memory word size = 16 bits. How many distinct operations can the ISA support?
2^4 = 16 distinct operations.
Example 2 (Pipeline speedup): A 4-stage pipeline, each stage 5 ns, executes 1000 instructions. Non-pipelined time per instruction = 20 ns. What is speedup?
Non-pipelined: 1000 * 20 = 20,000 ns. Pipelined: (4 + 1000, 1) * 5 = 1003 * 5 = 5015 ns. Speedup = 20,000 / 5015 ≈ 3.99.
Example 3 (Cache): A 4 KB direct-mapped cache with block size 16 bytes. Main memory is 64 KB. How many bits are in the tag?
Block offset bits = log2(16) = 4. Number of cache lines = 4096/16 = 256, so index bits = log2(256) = 8. Memory address bits = log2(64*1024) = 16. Tag bits = 16, 8 - 4 = 4.
Example 4 (DMA): A DMA controller transfers 4 KB in burst mode at 32 bits per cycle, with bus cycle time 10 ns. How long does the burst take?
4 KB = 4096 bytes = 1024 words of 32 bits. Time = 1024 * 10 ns = 10,240 ns = 10.24 microseconds.
Example 5 (Interrupts): Explain the difference between vectored and non-vectored interrupts.
In a non-vectored interrupt, all devices jump to the same ISR address; software must poll to find the source. In a vectored interrupt, each device supplies an interrupt vector (address or index) so the CPU can jump directly to the correct ISR, faster response, no polling overhead.
Quick-revision Summary
- Indirect addressing = 2 memory accesses; register-indirect = 1, Microprogrammed control is flexible, not fast; hardwired is fast but rigid, Ideal pipeline speedup = k (number of stages); actual speedup < k due to fill/drain, RAW is the only true data hazard in in-order pipelines; forwarding eliminates most RAW stalls, EMAT (non-overlapping) = hTc + (1-h)(Tc + Tm), Direct-mapped cache: index = block_address mod num_lines, LRU does not suffer Belady's anomaly; FIFO can, TLB reduces effective address-translation time; TLB miss = extra memory access for page table, Disk access = seek + rotational latency + transfer; avg rotational latency = half rotation, DMA: CPU sets up transfer, DMA handles data movement, CPU interrupted at completion, Interrupt-driven I/O: CPU notified when device ready; better than polling (no busy-wait), Structural hazard: resource conflict; data hazard: dependency; control hazard: branch
How to Study this Unit
-
Pipeline calculations are the highest-return topic. Learn the formula for pipeline execution time and speedup, then practice drawing pipeline stage diagrams with stalls and forwarding. GATE questions often show a code sequence and ask for cycles taken.
-
For cache questions, always identify the three address fields first. Write down "block offset = log2(block size), index = log2(number of sets), tag = remaining bits" before doing any calculation. Most errors come from confusing index bits and tag bits.
-
EMAT formula exists in two variants, read the problem carefully. Some problems say "on a miss, you access main memory additionally" (non-overlapping), others say memory access starts in parallel. Identify which model the question uses.
-
Draw the pipeline diagram for hazard questions. Do not try to count stalls in your head. A 5-column by n-row table with IF/ID/EX/MEM/WB headings makes stall cycles visually obvious.
-
Addressing modes: practice naming the mode from the syntax. Most GATE questions describe a mode in English ("the instruction contains the address of the memory location that contains the operand") and ask you to name it, indirect. Drill these one-liners until they are instant.
-
For memory hierarchy, solve PYQs from 2008-2024. Cache and virtual memory questions repeat formulaically. The numbers change but the structure is identical: given capacity, block size, associativity, find tag bits, index bits, offset bits, and then a miss penalty calculation.
Prefer watching over reading?
Subscribe for free.