Outline
• Introduction • Memory Hierarchy • Cache Memory • Cache Performance • Main Memory • Virtual Memory Translati ation on Lookas Lookaside ide Buffer Buffer • Transl • Alpha 21064 Example
COEN6741 Computer Architecture and Design Chapter 5 Memory Hierarchy Design (Dr. Sofiène Tahar) COEN6741 Chap 5.1
11/10/2003
Who Cares About the Memory Hierarchy?
Computer Architecture Topics
Processor-DRAM Memory Gap (latency)
Input/Output and Storage Disks, WORM, Tape
VLSI
Coherence, Bandwidth, Latency
L2 Cache
L1 Cache Instruction Set Architecture
RAID Emerging Technologies Interleaving Bus protocols
DRAM Memory Hierarchy
COEN6741 Chap 5.2
11/10/2003
Addressing, Protection, Exception Handling
Pipelining, Hazard Resolution, Superscalar, Reordering, Prediction, Speculation, Vector, DSP
Pipelining and Instruction Level Parallelism
1000
CPU
e c100 n a m r o f 10 r e P
µProc 60%/yr. (2X/1.5yr)
Processor-Memory Performance Gap: (grows 50% / year) DRAM
1 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 8 8 8 8 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 0 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
Time
DRAM 9%/yr. (2X/10 yrs)
Levels of the Memory Hierarchy Upper Level
Capacity Access Time Cost
Staging Xfer Unit
CPU Registers 100s Bytes <1s ns
Registers
Cache 10s-100s K Bytes 1-10 ns $10/ MByte
Cache
prog./compiler 1-8 bytes
Instr. Operands
Memory OS 512-4K bytes
Pages Disk 10s G Bytes, 10 ms (10,000,000 ns) $0.0031/ MByte Tape infinite sec-min $0.0014/ MByte
Proc/Regs user/operator Mbytes
Tape
Larger
Faster
Disk, Tape, etc. 11/10/2003
COEN6741 Chap 5.6
The Principle of Locality
Relationship of Caching and Pipelining
• The Principle of Locality:
I-Cache Next PC
Next SEQ PC
A d d e r
Next SEQ PC
– Programs access a relatively small portion of the address space at any instant of time.
M U X
• Two Different Types of Locality:
Zero? RS1
I F / I D
RS2
R e g F i l e
I D / E X
M U X M U X
Sign Imm Extend
RD
•
L1-Cache L2-Cache Memory
COEN6741 Chap 5.5
M e m o r y
Bigger
Lower Level
11/10/2003
A d d r e s s
Registers Registers “a cache” cache” on variables – software software managed managed First-level cache a cache on second-level cache Second-level cache a cache on memory Memory a cache on disk (virtual memory) TLB a cache on page table Branch-prediction a cache on prediction information?
Disk Files
4
• Small, fast storage used to improve average access time to slow memory. • Exploits spatial and temporal locality • In computer architecture, almost everything is a cache! – – – – – –
cache cntl 8-128 bytes
Blocks
Main Memory M Bytes 100ns 100ns- 300n 300nss $1/ MByte
faster
What is a cache?
RD
E X A / L M U E M
M E M M / e D W m a B o t e a r y h c a C D RD
M U X
a t a D B W
– Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) – Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced referenced soon (e.g., straightline straightline code, array access)
• Last 15 years, HW (hardware) relied on locality for speed
A Modern Memory Hierarchy
The Memory Abstraction
• By taking advantage of the principle of locality: – Present the user with as much memory as is available in the cheapest technology. t he fastest technology. – Provide access at the speed offered by the
• Association of pairs – typically named as byte addresses – often values aligned on multiples of size
• Requires servicing faults on the processor
• Sequence of Reads and Writes • Write binds a value to an address Read of addr returns returns most recent recently ly written written • Read value bound to that address
Processor Control
Datapath
R e g i s t e r s
C a c h e
Speed (ns): 1s Size (bytes): 100s
Second Level Cache (SRAM)
O n C h i p
Tertiary Secondary Storage Storage (Disk/Tape) (Disk)
Main Memory (DRAM)
10s
100s
Ks
Ms
address (name) data (W)
Gs
Ts
COEN6741 Chap 5.9
Memory Hierarchy: Terminology • Hit: Hit: data appears in some block in the upper level (example: Block X) Rate: the fraction of memory access ac cess found in the upper level – Hit Rate: – Hit Time Time:: Time to access the upper level which consists of access time + Time to determine hit/miss
Miss: data needs to be retrieve from a block in the • Miss: lower level (Block Y) – Miss Rate = 1 - (Hit (Hit Rat Rate) e) Penalty:: Time to replace a block in the upper level + – Miss Penalty Time to deliver the block the processor
• Hit Time << Miss Penalty (500 instructions on 21264!) Upper Level Memory
Lower Level Memory
BlkX
From Processor
data (R)
10,000,000s 10,000,000,000s (10s ms) (10s sec)
11/10/2003
To Processor
command (R/W)
BlkY
done COEN6741 Chap 5.10
11/10/2003
Cache Measures rate: fraction found in that level • Hit rate: – So high that usually talk about Miss rate – Miss rate fallacy: as MIPS to CPU performance, miss rate to average memory access time in memory
• Average Memory-Access Time (AMAT) = Hit time + Miss rate x Miss penalty (ns or clocks)
• Miss penalty: penalty: time to replace a block from lower level, including time to replace in CPU – access time: time to lower level = f(latency to lower level) – transfer time: time to transfer block =f(BW between upper & lower levels )
The Cache Design Space • Several interacting dimensions – – – – –
Cache Size
cache size block size associativity replacement policy write-through vs. write-back
Associativity
• Simplicity often wins
• Q1: Where can a block be placed in the upper level? (Block placement) – Fully Associative, Set Associative, Direct Mapped
• Q2: How is a block found if it is in the upper level? (Block identification)
Block Size
• The optimal choice is a compromise – depends on access characteristics » workload » use (I-cache, D-cache, TLB) – depends on technology / cost
Traditional Four Questions for Memory Hierarchy Designers
– Tag/Block
• Q3: Which block should be replaced on a miss? (Block replacement)
Bad
– Random, LRU, FIFO Good Factor A Less
Factor B
More
COEN6741 Chap 5.13
11/10/2003
Q1: Where can a block be placed in the upper level?
• Q4: What happens on a write? (Write strategy) – Write Back or Write Through (with Write Buffer) COEN6741 Chap 5.14
11/10/2003
Q2: How is a block found if it is in the upper level?
Tag on each block –
•
No need to check index or block offset
Incr Increa easi sing ng asso associ ciat ativ ivit ityy shri shrink nkss inde index, x, expands tag Block address Tag
Index
Block offset
The three portions of an address in a set-associative or direct-mapped cache.
Example cache has 8 block frames and memory has 32 bloc ks
Q3: Which block should be replaced on a miss?
Q4: What happens on a write? Write-through: all writes update cache and underlying • Write-through: memory/cache
Easy for Direct Mapped • Set Associative or Fully Associative: •
always discard cached cached data - most up-to-date up-to-date data data is in – Can always memory – Cache control bit: only a valid bit
Random – LRU (Least Recently Used) –
Write-back:: all writes simply update cache • Write-back
Associativity: 2-way
4-way
8-way
Size
LRU Rand.
LRU Random
LRU
Random
16 KB 64 KB
5.2% 1.9%
4.7% 1.5%
5.3% 1.7%
4.4% 1.4%
5.0% 1.5%
1.15% 1.17% 1.13%
1.13%
1.12%
1.12%
256 KB
5.7% 2.0%
– Can’t just just discard discard cached data data - may have to write write it back to memory – Cache control bits: both valid and dirty bits • Other Advantages: – Write-through: » memory (or other processors) always have latest data » Simpler management of cache – Write-back: » much lower bandwidth, since data often overwritten multiple times » Better tolerance to long-latency memory? COEN6741 Chap 5.18
11/10/2003
Write Buffer for Write Through
Write Policy: (What happens on write-miss?)
Cache Processor
DRAM
Write Buffer
• Write allocate: allocate new cache line in cache – Usually means that you have to do a “read miss” to fill in rest of the cache-line! – Alternative: per/word valid bits
“write-around”): • Write non-allocate (or “write-around”): – Simply send write data through to underlying memory/ca memory/cache che - don’t don’t allocate allocate new cache cache line!
•
A Write Buffer is needed between the Cache and Memory – –
•
Write buffer is just a FIFO: – –
•
Processor: writes data into the cache and the write buffer Memory controller: write contents of the buffer to memory Typical number of entries: 4 Works fine fine if: Store frequency frequency (w.r.t. time) time) << 1 / DRAM DRAM write cycle
Memory system designer’s nightmare: – –
Store Store freq frequen uency cy (w.r.t (w.r.t.. time) time) Write buffer saturation
-> 1 / DRAM DRAM writ writee cycl cyclee
Simplest Cache: Direct Mapped Memory Address
Example: 1 KB Direct Mapped Cache • For a 2 ** N byte cache:
Memory
0
uppermost (32 (32 - N) bits are always always the Cache Tag Tag – The uppermost – The lowest M bits are the Byte Select (Block Size = 2 ** M)
4 Byte Direct Mapped Cache Cache
1
Cache Index
2
Block address
0
3
31
1
4
Cach Cachee Tag Tag
2
5
3
6 8
– Memory location 0, 4, 8, ... etc. – In general: any memory location whose 2 LSBs LSBs of the address address are are 0s – Address<1:0> => cache index
9 A B
Valid Bit
D E F
Cache Tag
Cache Index
Byte Select
Ex: 0x01
Ex: 0x00
Byte 31
:
Byte 1
Byte 63
:
Byte 33 Byte 32 1
:
:
:
• N-way Set Associative Cache versus Direct Mapped Cache:
– N direct mapped caches operates in parallel
– N comparators vs. 1 – Extra MUX delay for the data – Data comes AFTER Hit/Miss decision and set selection
• Example: Two-way set associative cache – Cache Index selects a “set” from the cache – The two tags in the s et are compared to the input in parallel – Data is selected based on the tag t ag result
:
:
Adr Tag Tag
Compare
Cache Block 0
Cache Block 0
:
:
Sel1 1
OR
Mux
0 Sel0
COEN6741 Chap 5.22
Disadvantage of Set Associative Cache
• N-way set associative: associative: N entries for each Cache Index
Cache Data
Byte 992 31
:
11/10/2003
Set Associative Cache
Cache Tag
0
3
COEN6741 Chap 5.21
Cache Index Cache Data
Byte 0
2
Byte 1023
11/10/2003
0
Cache Data 0x50
• Which one should we place in the cache? • How can we tell which one is in the cache?
C
Exam Exampl ple: e: 0x50 0x50
4
Stored as part of the cache “state”
• Location 0 can be occupied by data from:
7
Valid
9
Cache Tag
:
Compare
• In a direct mapped cache, Cache Block is available BEFORE Hit/Miss: continue. Recover later if miss. – Possible to assume a hit and continue.
Valid
:
Valid
Cache Tag
:
:
Adr Adr Tag Tag
Compare
Cache Data
Cache Index Cache Data
Cache Block 0
Cache Block 0
:
:
Sel1 1
OR
Mux
0 Sel0
Cache Tag
Valid
:
:
Compare
Cache performance Block 1 Block address offset <21> < 8> 8> < 5> 5> Tag
CPU address Data Data in out
Index 4 Valid Tag <1> <21>
(256 blocks)
• Miss-oriented Approach to Memory Access: CPUtime
IC
CPI Execution
CPUtime
IC
CPI Execution
Data <256>
2
MemAccess Inst MemMisses Inst
MissRate MissPenalt y
MissPenalt y
CycleTime
CycleTime
– CPIExecution includes ALU and Memory instructions =?
3
• Separating out Memory component entirely
4:1 Mux
– AMAT = Average Memory Access Time – CPIALUOps does not include memory instructions
Write buffer
Lower level memory
The organization of the data cache in the Alpha AXP 21064 microprocessor.
CPUtime
IC
AluOps Inst
CPI
MemAccess
AluOps
Inst
AMAT
CycleTime
AMAT HitTime MissRate MissPenalt y HitTime Inst MissRate Inst MissPenalt y Inst HitTime
ata
MissRate
ata
MissPenalt y
ata
11/10/2003
Impact on Performance • Suppose a processor executes at – Clock Rate = 200 MHz (5 ns per cycle), Ideal (no misses) CPI = 1.1 – 50% arith/logic, 30% ld/st, 20% control
• Suppose that 10% of memory operations get 50 cycle miss penalty • Suppose that 1% of instructions get same miss penalty • CPI = ideal CPI + average stalls stalls per instruction 1.1(cycles/ins) + [ 0.30 (DataMops/ins) x 0.10 (miss/DataMop) x 50 (cycle/miss)] + [ 1 (InstMop/ins) x 0.01 (miss/InstMop) x 50 (cycle/miss)] = (1.1 (1.1 + 1.5 + .5) cycle/ins cycle/ins = 3.1 • 58% of the time the proc is stalled waiting for memory! AMAT=(1/1.3)x[ 1+0.01x50]+(0.3/1.3)x[ ]+(0.3/1.3)x[1+0.1x50 1+0.1x50]=2.54 ]=2.54 • AMAT=(1/1.3)x[1+0.01x50
COEN6741 Chap 5.26
Unified vs.. Split Caches • Unified vs. Separate I&D Proc Unified Cache-1 Unified Cache-2
I-Cache-1
Proc
D-Cache-1
Unified Cache-2
• Example: – 16KB I&D: Inst miss rate=0.64%, Data miss rate=6.47% – 32KB unified: Aggregate miss rate=1.99%
• Which is better (ignore L2 cache)? 75% accesses from instructions (1.0/1.33) – Assume 33% data ops – hit time=1, miss time=50 – Note that data hit has 1 stall for unified cache (only one port)
AMAT Harvard=75%x(1+0.64%x50)+25%x(1+6.47%x50) = 2.05 AMAT Unified=75%x(1+1.99%x50)+25%x(1+1+1.99%x50)= 2.24
Miss Rate Reduction
How to Improve Cache Performance?
CPUtime IC CPI
Execution
Memory Memory accesses accesses Instruction
Miss penalt penaltyy Clock Clock cycle cycle time time Miss rate Miss
• 3 Cs: Compulsory, Capacity, Conflict 0. 1. 2. 3. 4. 5. 6. 7.
AMAT HitTime MissRate MissPenalt y 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache.
Larger cache Reduce Misses via Larger Block Size Reduce Misses via Higher Associativity Reducing Misses via Victim Cache Reducing Misses via Pseudo-Associativity Reducing Misses by HW Prefetching Instr, Data Reducing Misses by SW Prefetching Data Reducing Misses by Compiler Optimizations
• Danger of concentrating on just one parameter! • Prefetching comes in two flavors: – Binding prefetch: Requests load directly into register. » Must be correct address and register! – Non-Binding prefetch: Load into cache. » Can be incorrect. Frees HW/SW to guess! COEN6741 Chap 5.29
11/10/2003
COEN6741 Chap 5.30
11/10/2003
Where to misses come from?
3Cs Absolute Miss Rate (SPEC92)
• Classifying Misses: 3 Cs – Compulsory—The first access to a block is not in the cache, – –
so the block must be brought into the cache. Also called cold start misses or first reference misses. misses . (Misses in even an Infinite Cache) Capacity —If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later l ater retrieved. (Misses in Fully Associative Size X Cache) Conflict—If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses or interference misses. misses. (Misses in N-way Associative, Size X Cache)
• 4th “C”: caused by by cache cohere coherence. nce. – Coherence - Misses caused
0.14 e p y t r e p e t a r s s i M
1-way
0.12
Conflict
2-way
0.1
4-way
0.08
8-way
0.06
Capacity
0.04 0.02 0 1
2
4
8
6 1
Cache Size (KB)
2 3
4 6
8 2 1
Compulsory
0. Cache Size 0.14
Cache Organization?
1-way
0.12
• •
2-way
0.1
4-way
0.08
8-way
0.06
Assume total cache size not n ot changed: What happens if:
1) Chang Changee Bloc Blockk Size Size::
Capacity
0.04
2) Chang Changee Associ Associati ativit vity: y:
0.02 0 1
2
4
8
6 1
2 3
4 6
3) Change Compiler:
8 2 1
Which of 3Cs is obviously affected?
Compulsory
Cache Size (KB)
• Old rule of thumb: 2x size => 25% cut in miss rate • What does it reduce? • Thrashing reduction!!! 11/10/2003
COEN6741 Chap 5.33
COEN6741 Chap 5.34
11/10/2003
1. Larger Block Size (fixed size & assoc) 25%
0.14 1K
20% Miss Rate
4K
15%
16K 10%
64K
5%
Reduced compulsory misses
2. Higher Associativity
256K
1-way
0.12
Conflict
2-way
0.1
4-way
0.08
8-way
0.06
Capacity
0.04 0.02
0% 6 1
2 3
4 6
Block Size (bytes)
What else drives up block size?
8 2 1
6 5 2
0
Increased Conflict Misses
1
2
4
8
6 1
Cache Size (KB)
2 3
4 6
8 2 1
Compulsory
3Cs Relative Miss Rate
Associativity vs.. Cycle Time
100%
• Beware: Execution time is only final measure! • Why is cycle time tied to hit time?
1-way 80%
Conflict
2-way 4-way 8-way
60%
• Will Clock Cycle time increase?
40%
– Hill [1988] suggested hit time for 2-way vs.. 1-way external cache +10%, internal + 2% – suggested big and dumb caches
Capacity
20%
Effective cycle time of assoc pzrbski ISCA
0% 1
2
4
8
6 1
Flaws: for fixed block size Good: insight => inventionCache Size (KB)
2 3
4 6
8 2 1
Compulsory COEN6741 Chap 5.37
11/10/2003
Example: Avg. Memory Access Time vs.. Miss Rate • Example: assume CCT = 1.10 for 2-way, 1.12 for 4-way, 1.14 for 8-way vs.. CCT direct mapped Cache Size (KB) 1-way 1 2.33 2 1.98 4 1.72 8 1.46 16 1.29 32 1.20 64 1.14 128 1.10
Associativity 2-way 4-way 2.15 2.07 1.86 1.76 1.67 1.61 1.48 1.47 1.32 1.32 1.24 1.25 1.20 1.21 1.17 1.18
8-way 2.01 1.68 1.53 1.43 1.32 1.27 1.23 1.20
(Red means A.M.A A.M.A.T. .T. not not impro improved ved by by more associa associativity tivity))
COEN6741 Chap 5.38
11/10/2003
3. Victim Cache • Fast Hit Time + Low Conflict => Victim Cache • How to combine fast hit time of direct mapped yet still avoid conflict misses? • Add buffer to place data discarded from cache • Jouppi [1990]: 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache • Used in Alpha, HP machines
TAGS
DATA
Tag and Comparator
One Cache line of Data
Tag and Comparator
One Cache line of Data
Tag and Comparator
One Cache line of Data
Tag and Comparator
One Cache line of Data To Next Lower Level In Hierarchy
4. Pseudo-Associativity • How to combine fast hit time of Direct Mapped and have the lower conflict misses of 2-way SA cache? • Divide cache: on a miss, check other half of cache to see if there, if so have a pseudo-hit (slow hit)
• E.g., Instruction Prefetching – Alpha 21064 fetches 2 blocks on a miss “stream buffer” buffer” – Extra block placed in “stream – On miss check stream buffer
• Works with data blocks too:
Hit Time Pseudo Hit Time
5. Hardwa Hardware re Prefe Prefetching tching of Instruct Instructions ions & Data
– Jouppi [1990] 1 data stream buffer got 25% misses from 4KB cache; 4 streams got 43% – Palacharla & Kessler [1994] for scientific programs for 8 streams got 50% to 70% of misses from 2 64KB, 4-way set associative caches
Miss Penalty
Time
• Prefetching relies on having extra memory bandwidth that can be used without penalty
• Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles – Better for caches not tied directly to processor (L2) – Used in MIPS R1000 L2 cache, similar in UltraSPARC COEN6741 Chap 5.41
11/10/2003
6. Sof Softwa tware re Pre Prefet fetchi ching ng Data Data • Data Prefetch – Load data into register (HP PA-RISC loads) – Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9) – Special prefetching instructions cannot cause faults; a form of speculative execution
• Prefetching comes in two flavors: – Binding prefetch: Requests load directly into register. » Must be correct address and register! – Non-Binding prefetch: Load into cache. » Can be incorrect. Faults?
• Issuing Prefetch Instructions takes time – Is cost of prefetch issues < savings in reduced misses? – Higher superscalar reduces difficulty of issue bandwidth
COEN6741 Chap 5.42
11/10/2003
7. Compiler Optimizations • McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache, 4 byte blocks in software • Instructions – Reorder procedures in memory so as to r educe conflict misses – Profiling to look at conflicts(using tools they developed)
• Data – Merging Arrays : improve spatial locality by single array of compound elements vs.. 2 arrays – Loop Interchange : change nesting of loops to access data in order stored in memory – Loop Fusion : Combine 2 independent loops that have same looping and some variables overlap – Blocking : Improve temporal locality by accessing “blocks” of data repeatedly vs.. going down whole columns or rows
Loop Interchange Example
Merging Arrays Example
/* After: 1 array of stuctures */ struct merge { int val; int key; }; struct merge merged_array[SIZE];
/* Before */ for (k = 0; k < 100; k = k+1) for (j = 0; j < 100; j = j+1) for (i = 0; i < 5000; i = i+1) x[i][j] = 2 * x[i][j]; /* After */ for (k = 0; k < 100; k = k+1) for (i = 0; i < 5000; i = i+1) for (j = 0; j < 100; j = j+1) x[i][j] = 2 * x[i][j];
Reducing conflicts between val & key; improve spatial locality
Sequential accesses instead of striding through memory every 100 words; improved spatial locality
/* Before: 2 sequential arrays */ int val[SIZE]; int key[SIZE];
COEN6741 Chap 5.45
11/10/2003
Loop Fusion Example
COEN6741 Chap 5.46
11/10/2003
Blocking Example
/* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) a[i][j] = 1/b[i][j] * c[i][j] c[i][j]; ; for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) d[i][j] = a[i][j] + c[i][j] c[i][j]; ; /* After */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) { a[i][j [j] ] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j];}
/* Before */ for (i (i = 0; i < N; i = i+1) for (j (j = 0; j < N; j = j+1) {r = 0; for (k (k = 0; k < N; k = k+1){ r = r + y[i y[i][k ][k]*z[k ]*z[k][j ][j];}; x[i x[i][j ][j] = r; };
2 misses per access to a & c vs.. one miss per access; improve spatial locality
• Capacity Misses a function of N & Cache Size:
• Two Inner Loops: – Read all NxN elements of z[] – Read N elements of 1 row of y[] repeatedly – Write N elements of 1 row of x[] – 2N3 + N2 => (assuming no conflict; otherwise …)
• Idea: Idea: compute compute on BxB BxB submatrix submatrix that fits fits
Reducing Conflict Misses by Blocking
Blocking Example
0.1
/* After */ for (jj (jj = 0; jj jj < < N; jj = jj jj+B) +B) for (kk (kk = 0; kk kk < < N; kk = kk kk+B) +B) for (i (i = 0; i < N; i = i+1) for (j (j = jj jj; ; j < min(jj min(jj+B-1,N); +B-1,N); j = j+1) {r = 0; for (k (k = kk kk; ; k < min(kk min(kk+B-1,N); +B-1,N); k = k+1) { r = r + y[i y[i][k ][k]*z[k ]*z[k][j ][j];}; x[i x[i][j ][j] = x[i x[i][j ][j] + r; };
Fully Associative Cache 0 0
• B called Blocking Factor • Capacity Misses from 2N3 + N2 to 2N3/B+N2 • Conflict Misses Too?
50
100
• Conflict misses in caches not FA vs.. Blocking size
Summary of Compiler Optimizations to Reduce Cache Misses (by hand)
[1991] a blocking factor of 24 had a fifth the misses – Lam et al [1991] vs.. 48 despite both fit in cache COEN6741 Chap 5.50
11/10/2003
Improving Cache Performance
vpenta (nasa7) gmty (nasa7)
1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache.
tomcatv btrix (nasa7) mxm (nasa7) spice cholesky (nasa7) compress 1
1.5
2
2.5
3
Performance Improvement merged arrays
loop interchange
loop fusion
blocking
150
Blocking Factor
COEN6741 Chap 5.49
11/10/2003
Direct Mapped Cache
0.05
Reducing Miss Penalty
CPUtime IC CPI
Execution
Memory Memory accesses accesses Instruction
1. Read Priority over Write on Miss
Miss penalt penaltyy Clock Clock cycle cycle time time Miss rate Miss
• Four techniques – – – –
CPU
Read priority over write on miss Early Restart and Critical Word First on miss Non-blocking Caches (Hit under Miss, Miss under Miss) Second Level Cache
in out
Write Buffer
• Can be applied recursively to Multilevel Caches – Danger is that time to DRAM will grow with multiple levels in between – First attempts at L2 caches can make things worse, since increased worst case is worse
write buffer
• Out-of-order CPU can hide L1 data cache miss ( 3–5 clocks), but stall on L2 miss ( 40–100 clocks)? COEN6741 Chap 5.53
11/10/2003
1. Read Priority over Write on Miss • Write-through w/ write buffers => RAW conflicts with main memory reads on cache misses – If simply wait for write buffer to empty, might increase read miss penalty (old MIPS 1000 by 50% ) – Check write buffer contents before read; if no conflicts, let the memory access continue
• Write-back want buffer to hold displaced blocks – Read miss replacing dirty block – Normal: Write dirty block to memory, and then do the read – Instead copy the dirty block to a write buffer, then do the read, and then do the write – CPU stall less since restarts as soon as do read
DRAM (or lower mem) COEN6741 Chap 5.54
11/10/2003
2. Early Restart and Critical Word First • Don’t wait for full block to be loaded before restarting CPU – Early restart—As restart—As soon as the requested word of the block arrives, send it to the CPU and a nd let the CPU continue execution First—Request the missed word first from memory – Critical Word First—Request and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block. Also called wrapped fetch and requested word first
• Generally useful only in large blocks, • Spatial locality => tend to want next sequential word, so not clear if benefit by early restart
block
Value of Hit Under Miss for SPEC
3. Non-blocking Caches
Hit Under i Misses 2
• Non-blocking cache or lockup-free cache allow data cache to continue to supply cache hits during a miss
1.8 1.6 1.4
– requires F/E bits on registers or out-of-order execution – requires multi-bank memories
– Significantly increases the complexity of the cache controll er as there can be multiple outstanding memory accesses – Requires multiples memory banks (otherwise cannot support) – Pentium Pro allows 4 outstanding memory misses COEN6741 Chap 5.57
4. Add a Second-level Cache • L2 Equations AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1 Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2 AMAT = Hit TimeL1 + Miss RateL1 x (Hit TimeL2 + Miss RateL2 + Miss PenaltyL2)
• Definitions: rate— misses in this this cache divided divided by the total total number number of – Local miss rate— memory accesses to this cache (Miss rateL2) rate—misses in this cache divided by the total number of – Global miss rate—misses memory accesses generated by the CPU – Global Miss Rate is what matters
1->2
1
• “hit under miss” miss” reduces the effective miss penalty by working during miss vs.. ignoring CPU requests • “hit under multiple miss” miss” or “miss “miss under miss” miss” may further lower the effective miss penalty by overlapping multiple misses
11/10/2003
0->1
1.2
2->64
0.8
Base
0.6 0.4
0->1 1->2 2->64 Base
“Hit under n Misses”
0.2 0 t t o t n q e
o p s s s i l e r x p s e
Integer
s s e r p m o c
r p v t 2 a p e p p a s j p c l f m d o t m
c r 5 u o 6 d 2 d n n 5 o c e v p 2 i 2 d 2 a j d o v r l u w l m d d s y a w h m s
7 6 a g s a 2 n e c i p s
a r o
Floating Point
• FP programs on average: AMAT= 0.68 -> 0.52 -> 0.34 -> 0.26 • Int programs on average: AMAT= 0.24 -> 0.20 -> 0.19 -> 0.19 • 8 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss COEN6741
11/10/2003
Chap 5.58
Comparing Local and Global Miss Rates • 32 KByte KByte 1st level level cache; cache; Increasing 2nd level cache • Global miss rate close to single level cache rate provided L2 >> L1 • Don’t use local miss rate • L2 not tied to CPU clock cycle! • Cost & A.M.A.T. • Generally Fast Hit Times and fewer misses • Since hits are few, target miss reduction
Linear
Cache Size Log
Cache Size
L2 Cache Block Size & A.M.A.T.
Reducing Misses: Which apply to L2 Cache?
Relative CPU Time
• Reducing Miss Rate
2 1.9 1.8 1.7 1.6 1.5 1.4 1.3 1.2 1.1 1
1. Reduce Misses via Larger Block Size 2. Reduce Conflict Misses via Higher Associativity 3. Reducing Conflict Misses via Victim Cache 4. Reducing Conflict Misses via Pseudo-Associativity 5. Reducing Misses by HW Prefetching Instr, Data 6. Reducing Misses by SW Prefetching Data 7. Reducing Capacity/Conf. Misses by Compiler Optimizations
1.95
1.54 1.36
16
1.28
1.27
32
64
1.34
128
256
512
Block Size COEN6741 Chap 5.61
11/10/2003
Improving Cache Performance 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache.
• 32KB L1, 8 byte path to memory 11/10/2003
1. Small and Simple Caches • Why Alpha 21164 has 8KB Instruction and 8KB data cache + 96KB second level cache? – Small data cache and clock rate
• Direct Mapped, on chip
COEN6741 Chap 5.62
2. Avoiding Address Translation
Virtually Addressed Caches
• Send virtual address to cache? Called Virtually Addressed Cache or just Virtual Cache vs.. Physical Cache
VA Tags
TB
– Add process identifier tag that identifies process as well as address within process: can’t get a hit if wrong process COEN6741 Chap 5.65
3. Pipelined Writes • Pipeline Tag Check and Update Cache as separate stages; current write tag check & previous write cache update • Only STORES in the pipeline; empty during a miss
Check r1 --M[r1]<-r2 & check r3
Buffer”; must be checked on reads; • “Delayed Write Buffer”; either complete write or read from buffer
$
TB
VA
PA L2 $
– HW guarantees every cache block has unique physical address – SW guarantee : lower n bits must have same address; as long as covers index field & direct mapped, they must be unique; called page coloring
11/10/2003
VA PA Tags
$
PA TB
$
• Solution to cache flush
CPU VA
VA
• Solution to aliases
Store r2, (r1) Add Sub Store r4, (r3)
CPU
CPU
– Every time process is switched logically must flush the cache; otherwise get false hits » Cost is time to flush + “compulsory” misses from empty cache synonyms); ); – Dealing with aliases (sometimes called synonyms Two different virtual addresses addresses map to same physical address – I/O must interact with cache, so need virtual address
MEM
MEM
Conventional Organization
Virtually Addressed Cache Translate only on miss Synonym Problem
11/10/2003
MEM
PA
PA
Overlap $ access with VA translation: requires $ index to remain invariant COEN6741 across translation Chap 5.66
Case Study: MIPS R4000 • 8 Stage Pipeline: – IF–first half of fetching of instruction; PC selection happens here as well as initiation of instruction cache access. – IS–second half of access to instruction cache. – RF–instruction decode and register fetch, hazard checking and also instruction cache hit detection. – EX–execution, which includes effective address calculation, ALU operation, and branch target computation and condition evaluation. – DF–data fetch, first half of access to data cache. – DS–second half of access to data cache. – TC–tag check, determine whether the data cache access hit. – WB–write back for loads and register-register operations.
• What is impact on Load delay? – Need 2 instructions between a load and its use!
R4000 Performance
Case Study: MIPS R4000 TWO Cycle Load Latency
IF
IS IF
IF IS THREE Cycle IF Branch Latency (conditions evaluated during EX phase) Delay slot plus two stalls Branch likely cancels delay slot
RF IS IF
EX RF IS IF
DF EX RF IS IF
DS DF EX RF IS IF
TC DS DF EX RF IS IF
WB TC DS DF EX RF IS IF
RF IS IF
EX RF IS IF
DF EX RF IS IF
DS DF EX RF IS IF
TC DS DF EX RF IS IF
WB TC DS DF EX RF IS IF
if not taken
• Not ideal CPI of 1: – Load stalls (1 or 2 clock cycles) – Branch stalls (2 cycles + unfilled slots) stalls: RAW data hazard (latency) – FP result stalls: stalls: Not enough FP hardware (parallelism) – FP structural stalls: 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0
t t o t n q e
Base
o s s e r p s e
c c g
Load stalls
i l
c u d o d
Branch stalls
7 a s a n
a r o
r o c 2 u s
6 g 2 e c i p s
FP result stalls
v t a c m o t
FP structural stalls
COEN6741 Chap 5.69
11/10/2003
What is the Impact of What You’ve Learned About Caches? 1000
• 1960-1985: Speed = ƒ(no. operations) • 1990 100 – Pipelined Execution & Fast Clock Rate 10 – Out-of-Order execution – Superscalar Instruction Issue 1 0 1 2 3 4 5 6 7 8 9 0 • 1998: Speed = 8 8 8 8 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 1 1 1 1 1 1 1 1 1 1 1 ƒ(non-cached memory accesses) • What does this mean for – Compilers?,Operating Systems?, Algorithms? Data Structures?
Cache Optimization Summary CPU
Processor-Memory Performance Gap: (grows 50% / year) DRAM
1 9 9 1
2 3 4 5 6 7 8 9 9 9 9 9 9 9 9 9 9 9 9 9 9 1 1 1 1 1 1 1
COEN6741 Chap 5.70
11/10/2003
9 0 9 0 9 0 1 2
e t a r s s i m
y s t l s a i n m e p e m i t t i h
Technique Larger Block Size Higher Associativity Victim Caches Pseudo-Associative Caches HW Prefetching of Instr/Data Compiler Controlled Prefetching Compiler Reduce Misses Priority to Read Misses Early Restart & Critical Word 1st Non-Blocking Caches Second Level Caches Better memory system Small & Simple Caches Avoiding Address Translation Pipelining Caches
MR + + + + + + +
MP HT – –
+ + + + + –
+ + +
Complexity 0 1 2 2 2 3 0 1 2 3 2 3 0 2 2
Alpha Memory Performance: Miss Rates of SPEC92
Cache Cross Cutting Issues
100.00% AlphaSort Espresso
P orts • Superscalar CPU & Number Cache Ports must match: number memory accesses/cycle? • Speculative Execution and non-faulting option on memory/TLB • Parallel Execution vs.. Cache locality
e t a R s s i M
and consistency consistency of data data between between cache cache • I/O and and memory
COEN6741 Chap 5.73
Alpha CPI Components •
Ear
Alvinn
Mdljp2
Nasa7
8K I$
Instructio Instructionn stall: stall: branch mispredi mispredict ct (green); (green); Data cache (blue); Instruction cache (yellow); L2$ (pink) ( pink) Other: Other: compute compute + reg conflicts, conflicts, structu structural ral conflicts conflicts
D $8K
1.00%
L2
0.10%
– Caches => multiple copies of data – Consistency by HW or by SW? – Where connect I/O to computer?
•
Mdljsp2
10.00%
– Want far separation to find independent operations vs.. want reuse of data accesses to avoid misses
11/10/2003
Sc
I$ miss = 6% D$ miss = 32% L2 miss = 10%
I$ miss = 2% D$ miss = 13% L2 miss = 0.6%
2M
I$ miss = 1% D$ miss = 21% L2 miss = 0.3%
0.01%
COEN6741 Chap 5.74
11/10/2003
Predicting Cache Performance from Different Prog. (ISA, compiler, ...) 35%
D$, Tom 30%
5.00 4.50 4.00 3.50 L2
3.00 I P C
I$ D$
2.50
I Stall
2.00
Other
1.50 1.00
• 4KB Data cache miss rate 8%,12%, or 25% 28%? D$, gcc • 1KB Instr cache miss 20% rate 0%,3%,or 10%?Miss Rate • Alpha vs.. MIPS D$, esp 15% for 8KB Data $: 17% vs.. 10% 10% • Why 2X Alpha v. I$, gcc MIPS?
D: gcc D: espresso I: gcc I: espresso I: tomcatv
5%
I$, esp
0.50 0%
0.00 AlphaSort
D: tomcatv
Espresso
Sc
Mdljsp2
Ear
Alvinn
Mdljp2
1
2
I$, Tom
4
8
16
32
64
128
Main Memory Deep Background
Main Memory Background •
Performance of Main Memory: Latency: Cache Miss Penalty – Latency: Time: time between request and word arrives » Access Time: Time: time between requests » Cycle Time: – Bandwidth Bandwidth:: I/O & Large Block Miss Penalty (L2)
•
•
•
•
Main Memory is DRAM DRAM:: Dynamic Random Access Memory – Dynamic since needs to be refreshed periodically (8 ms, 1% time) – Addresses divided into 2 halves (Memory as a 2D matrix): » RAS or Row Access Strobe » CAS or Column Access Strobe
•
•
•
“Out-of-Core”, “In-Core,” “Core Dump”? “Core memory”? Non-volatile, magnetic Lost to 4 Kbit Kbit DRAM (today (today using using 64Kbit 64Kbit DRAM) Access time 750 ns, cycle time 1500-3000 ns
Cache uses SRAM SRAM:: Static Random Access Memory – No refresh (6 transistors/bit vs.. 1 transistor /bit, area is 10X) – Address not divided: Full addreess
•
Size: DRAM/SRAM 4-8 Size: 4-8,, Cost/Cycle time: time: SRAM/DRAM 8-16 COEN6741 Chap 5.77
11/10/2003
DRAM Physical Organization (4 Mbit)
DRAM Logical Organization (4 Mbit)
11
Column Decoder … Sense Amps & I/O I/ O
Column Address
Memory Array (2,048 x 2,048) Storage Word Line Line Cell
• Square root of bits per RAS/CAS
I/O I/O
8 I/Os
…
I/O I/O
D Row Address
A0…A10
COEN6741 Chap 5.78
11/10/2003
Q
D Block Row Dec. 9 : 512
Block Row Dec. 9 : 512
I/O I/O
Block 0
…
…
Block Row Dec. 9 : 512
Block Row Dec. 9 : 512
I/O I/O
Block 3
Q
8 I/Os
4 Key DRAM Timing Parameters
DRAM Performance
• tRAC: minimum time from RAS line falling to the valid data output.
(tRAC) DRAM can • A 60 ns (t – perform a row access only every 110 ns (t ( tRC) – perform column access (t ( tCAC) in 15 ns, but time between column accesses is at least 35 ns (t ( tPC).
– Quoted as the speed of a DRAM when buy – A typical 4Mb DRAM tRAC = 60 ns – Speed of DRAM since on purchase sheet?
» In practice, external address delays and turning around buses make it 40 to 50 ns
• tRC: minimum time from the start of one row access to the start of the next.
• These times do not include the time to drive the addresses off the microprocessor nor the memory controller overhead!
– tRC = 110 ns for a 4Mbit DRAM with a t RAC of 60 ns
• tCAC: minimum time from CAS line falling to valid data output. – 15 ns for a 4Mbit DRAM with a t RAC of 60 ns
• tPC: minimum time from the start of one column access to the start of the next. – 35 ns for a 4Mbit DRAM with a t RAC of 60 ns COEN6741 Chap 5.81
11/10/2003
COEN6741 Chap 5.82
11/10/2003
DRAM History
DRAM DR AM Fut Futur ure: e: 1 Gbi Gbitt DR DRAM AM
• DRAMs: capacity +60%/yr, cost –30%/yr – 2.5X cells/area, 1.5X die size in
3 years
DRAM fab fab line line costs costs $2B $2B • ‘98 DRAM – DRAM only: density, leakage v. speed
• Rely on increasing no. of computers & memory per computer (60% market) – SIMM or DIMM is replaceable unit => computers use any generation DRAM
• Commodity, second source industry => high volume, low profit, conservative – Little organization innovation in 20 years
• Order of importance: 1) Cost/bit 2) Capacity – First RAMBUS: 10X BW, +30% cost => little impact
• • • • • •
Blocks Clock Data Pins Die Size Metal Layers Technology
Mitsubishi Samsung 512 x 2 Mbit 1024 x 1 Mbit 200 MHz 250 MHz 64 16 24 x 24 mm 31 x 21 mm 3 4 0.15 micron 0.16 micron
• Wish could do this for Microprocessors!
Main Memory Performance Main Memory Performance •
Simple:: Simple –
•
CPU, Cache, Bus, Memory same width (32 or 64 bits)
Wide:: Wide –
•
• Timing model (word size is 32 bits) – 1 to send address, – 6 access time, 1 to send data – Cache Block is 4 words
CPU/Mux 1 word; Mux/ Cache, Bus, Memory N words (Alpha: 64 bits & 256 bits; UtraSPARC 512)
= 4 x (1+6+1) = 32 • Simple M.P. • Wide M.P. =1+6+1 =8 • Interleaved M.P. = 1 + 6 + 4x1 = 11
Interleaved:: Interleaved –
CPU, Cache, Bus 1 word: Memory N Modules (4 Modules); example is word interleaved COEN6741 Chap 5.86
11/10/2003
Independent Memory Banks • Memory banks for independent i ndependent accesses vs.. faster sequential accesses – Multiprocessor – I/O – CPU with Hit under n Misses, Non-blocking Cache
Superbank: all memory active on one block • Superbank: transfer (or Bank Bank)) Bank: portion portion within within a superb superbank ank that is word word • Bank: interleaved (or Subbank Subbank))
…
Independent Memory Banks • How many banks? number banks number clocks to access word in bank – For sequential accesses, otherwise will return to original bank before it has next word ready – (like in vector case) • Increasing DRAM => fewer chips => harder to have banks
Fast Bank Number
Avoiding Bank Conflicts
• Chinese Remainder Theorem
• Lots of banks
int x[256][512]; for (j = 0; j < 512; j = j+1) for (i = 0; i < 256; i = i+1) x[i][j] = 2 * x[i][j];
• Even with 128 banks, since 512 is multiple of 128, conflict on word accesses • SW: loop interchange or declaring array not power of 2 (“array padding”) • HW: Prime number of banks – – – – –
bank number = address mod number of banks address within bank = address / number of words in bank modulo & divide per memory access with prime no. banks? address within bank = address mod number words in bank bank number? easy if 2 N words per bank
11/10/2003
COEN6741 Chap 5.89
Fast Memory Systems: DRAM specific
As long as two sets of integer integerss ai and bi follow follow these rules rules b i mod a i , 0 b i a i , 0 x a 0 a 1 a 2 and that that ai and aj are co-prime co-prime if i ° j, then the the integer integer x has has only one one solution (unambiguous mapping): – bank number = b 0, number of banks = a 0 (= 3 in example) – address within bank = b 1, number of words in bank = a 1 (= 8 in example) – N word address 0 to N-1, prime no. banks, words power of 2
Bank Number: Address within Bank: 0 1 2 3 4 5 6 7 11/10/2003
0
Seq. In Interleaved 1 2
0 3 6 9 12 15 18 21
1 4 7 10 13 16 19 22
Modulo In Interleaved 0 1 2
2 5 8 11 14 17 20 23
0 9 18 3 12 21 6 15
16 1 10 19 4 13 22 7
8 17 2 11 20 5 14 23
DRAM Latency >> BW
• Multiple CAS accesses: several names (page mode) – Extended Data Out (EDO): (EDO): 30% faster in page mode
• New DRAMs DRAMs to address address gap; what will they cost, will they survive? – RAMBUS: RAMBUS: startup company; reinvent DRAM interface » Each Chip a module vs.. slice of memory » Short bus between CPU and chips » Does own refresh » Variable amount of data returned » 1 byte / 2 ns (500 MB/s per chip) DRAM: 2 banks on chip, a clock signal to DRAM, – Synchronous DRAM: transfer transfer synchronous synchronous to system system clock (66 - 150 MHz) – Intel claims RAMBUS Direct (16 b wide) is future PC memory
• Niche memory or main memory? e.g., Video RAM for frame buffers, DRAM + fast serial output
• More App Bandwidth => Cache misses => DRAM RAS/CAS • Application BW => Lower DRAM Latency • RAMBUS, Synch DRAM increase BW but higher latency • EDO DRAM < 5% in PC
Proc I$
D$ L2$ Bus
D R A M
D R A M
D R A M
D R A M
COEN6741 Chap 5.90
Main Memory Summary
DRAM Crossroads?
• Wider Memory
• After 20 years of 4X every 3 years, running into into wall? wall? (64Mb (64Mb - 1 Gb) Gb)
• Interleaved Memory: for sequential or independent accesses
• How can can keep $1B $1B fab lines full full if buy fewer fewer DRAMs DRAMs per per comp compute uter? r?
• Avoiding bank conflicts: SW & HW
• Cost/bit –30%/yr if stop 4X/3 yr?
• DRAM specific optimizations: page mode & Specialty DRAM
• What will happen to $40B/yr DRAM industry?
• DRAM future less rosy?
COEN6741 Chap 5.93
11/10/2003
DRAM DR AMss pe perr PC ov over er Tim Timee
Virtual Memory
DRAM DRA M Gen Genera eratio tion n
4 e z i S 8 y r o 16 m e M 32 m 64 u m i n 128 i M
MB MB MB MB MB
‘8 6 1 Mb 32
‘8 9 4 Mb 8 16
‘9 2 ‘9 6 ‘9 9 ‘0 2 16 Mb 64 Mb 256 Mb 1 Gb 4 8
COEN6741 Chap 5.94
11/10/2003
2 4
1
8
2
MB
4
1
256 MB
8
2
• A virtual memory is a memory hierarchy, usually consisting of at least main memory and disk, in which the processor issues all memory references as effective addresses in a flat address space. • All translations to primary and secondary addresses are handled transparently, thus providing the illusion of a flat address space. • Recall that disk accesses may require 100,000 clock cycles to complete, due to the slow access time of the disk subsystem.
Basic Issues in VM System Design size of information blocks that are transferred from secondary to main storage (M) block of information brought into M, and M is full, then some region of M must be be released to make room room for the new new block --> replacement policy which region region of M is to hold hold the new new block --> placement policy missing item fetched from secondary memory only on the occurrence occurrence of a fault --> demand load policy disk
mem
cache
Addressing and Accessing a Two-Level Hierarchy
The computer system, HW or SW, must perform any address translation System that is address required:
Memory management unit (MMU)
Address in secondary memory
Miss
Secondary level
Translation function (mapping tables, permissions, etc.)
Block
reg
pages
Hit
frame
Primary level
Address in primary memory
Paging Organization
Word
size virtual and physical address space partitioned into blocks of equalsize
Two ways of forming the address: Segmentation and Paging. Paging is more common. Sometimes the two are used together, one “on top of” the other. More about address translation and paging next ...
page frames pages
Paging vs. Segmentation System address add ress Block
Paging Organization V.A.
P.A.
0 1024
frame 0 1
1K 1K
7168
7
1K
Addr Trans MAP
System address add ress Word
Block
Word
Physical Memory Lookup table
0 1024
31744 31 744
Lookup table
page 0 1
31
1K 1K
also unit of transfer from virtual to physical 1K memory
Virtual Memory
Address Mapping Base address Block
Word
+
VA Word
Primary address Primary address
Paging
10 disp
page no.
Segmentation
Page Table Page Table Base Reg index into page table
V
Access Rights
PA
table located in physical memory
+ physical memory address
unit of mapping
actually, concatenation concatenation is more likely
Transla Tra nslatio tionn Lookas Lookaside ide Buf Buffer fer
Main memory FFF
Segmentation Organization
0
Segment Segment 5 Gap
0 Virtual memory addresses
Segment Segment 1 Segment Segment 6
0
Physical memory addresses
A way to speed up translation is to use a special cache of recent recently ly used used page page table table entrie entriess -- this this has has many many names, but the most frequently used is Translation Look Lo okas asid idee Bu Buff ffer er or TLB Virtual Address Physical Address Dirty Ref Valid Access
Gap
0
Segment Segment 9
0
Segment Segment 3
0000
Notice that each segment’s virtual address starts at 0, different from its physical address. • Repeated movement of segments into and out of physical memory will result in gaps between segments. This is called external fragmentation. fragmentation. •
•
Compaction routines must be occasionally run to remove these fragments.
Really just a cache on the page table mappings TLB access time comparable to cache access time (much less than main memory access time) COEN6741 Chap 5.102
11/10/2003
Transla Tra nslatio tionn Lookas Lookaside ide Buf Buffer ferss • Just like any other cache, the TLB can be organized as fully associative, set associative, or direct mapped • TLBs are usually usually small, small, typically typically not more more than 128 128 256 entries entries even on high end machines. machines. This permits permits fully Associative lookup on these machines. – Most mid-range machines use small n-way set associative organizations. hit miss VA PA TLB Main Cache CPU Lookup Memory
Translation with a TLB
miss
hit
Address Translation VA CPU
Translation
and Cache miss
PA Cache
Main Memory
hit data
• Page table is a large data structure in memory • Two memory accesses for every load, store, or instruction fetch!!! • Virtually addressed cache? – synonym problem
Translation data
• Cache the address translations? • If index is physical part of address, can start tag access in parallel with translation so that can compare to physical tag
Memory Hierarchy (Summary)
Overlapped Cache & TLB Access 32
TLB
index
assoc lookup
Cache
1K
4 bytes 10
2 00
PA
Hit/ Miss
20 page #
PA
12 disp
Data
Hit/ Miss
• The memory hierarchy: from fast and expensive to slow and cheap: Registers Cache Main Memory Disk • At first, consider just two adjacent levels in the hierarchy • The cache: High speed and expensive – Direct mapped, associative, set associative
• Virtual memory—makes the hierarchy transparent – Translate the address from CPU’s logical address to the physical address where the information is actually stored – The “TLB” helps in speeding the address translation process
= IF cache hit AND (cache tag = PA) then deliver data to CPU ELSE IF [cache miss OR (cache tag = PA)] and TLB hit THEN access memory with the PA from the TLB ELSE do standard VA translation
• Memory management—how to move information back and forth COEN6741 Chap 5.106
11/10/2003
TLB Practical Memory Hierarchy • Issue is NOT inventing new mechanisms • Issue is taste in selecting between many alternatives in putting together a memory hierarchy that fit well together – e.g., L1 Data cache write through, L2 Write back
cy cle, – e.g., L1 small for fast hit time/clock cycle, – e.g., L2 big enough to avoid going to DRAM?
and
Virtual Memory
• Caches, TLBs, Virtual Memory all understood by examining how they deal with 4 questions: 1) Where can block be placed? 2) How is block found? 3) What What block block is repalced repalced on miss? 4) 4) How are are writes handled? • Page tables map virtual address to physical address • TLBs make virtual virtual memory memory pract practical ical – Locality in data => locality in addresses of data, temporal and spatial
• TLB misses are significant in processor performance – funny times, as most systems can’t access all of 2nd level cache without TLB misses!
• Today VM allows many processes to share single memory without having to swap all processes to disk; today VM protection is more important than memory hierarchy
Alph Al phaa 21 06 064 4 • Separate Instr & Data TLB & Caches • TLBs fully associative • TLB updates in SW (“Priv Arch Libr”)
Instr
Data
• Caches 8KB direct mapped, write thru • Critical 8 bytes first • Prefetch instr. stream buffer • 2 MB L2 cache, direct mapped, WB (off-chip) • 256 bit path to main memory, 4 x 64-bit modules • Victim Buffer: to give read priority over write • 4 entry write buffer between D$ & L2$
Write Buffer Stream Buffer
Victim Buffer
UCB 35 DAP Spr.‘98 © Spr.‘98 © UCB