Computer Architecture Cache Design

Outline

• Introduction • Memory Hierarchy • Cache Memory • Cache Performance • Main Memory • Virtual Memory Translati ation on Lookas Lookaside ide Buffer Buffer • Transl • Alpha 21064 Example

COEN6741 Computer Architecture and Design Chapter 5 Memory Hierarchy Design (Dr. Sofiène Tahar) COEN6741 Chap 5.1

11/10/2003

Who Cares About the Memory Hierarchy?

Computer Architecture Topics

Processor-DRAM Memory Gap (latency)

Input/Output and Storage Disks, WORM, Tape

VLSI

Coherence, Bandwidth, Latency

L2 Cache

L1 Cache Instruction Set Architecture

RAID Emerging Technologies Interleaving Bus protocols

DRAM Memory Hierarchy

COEN6741 Chap 5.2

11/10/2003

Addressing, Protection, Exception Handling

Pipelining, Hazard Resolution, Superscalar, Reordering, Prediction, Speculation, Vector, DSP

Pipelining and Instruction Level Parallelism

1000

CPU

e c100 n a m r o f 10 r e P

µProc 60%/yr. (2X/1.5yr)

Processor-Memory Performance Gap: (grows 50% / year) DRAM

1 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 8 8 8 8 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 0 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2

Time

DRAM 9%/yr. (2X/10 yrs)

Levels of the Memory Hierarchy Upper Level

Capacity Access Time Cost

Staging Xfer Unit

CPU Registers 100s Bytes <1s ns

Registers

Cache 10s-100s K Bytes 1-10 ns $10/ MByte

Cache

prog./compiler 1-8 bytes

Instr. Operands

Memory OS 512-4K bytes

Pages Disk 10s G Bytes, 10 ms (10,000,000 ns) $0.0031/ MByte Tape infinite sec-min $0.0014/ MByte

Proc/Regs user/operator Mbytes

Tape

Larger

Faster

Disk, Tape, etc. 11/10/2003

COEN6741 Chap 5.6

The Principle of Locality

Relationship of Caching and Pipelining

• The Principle of Locality:

I-Cache Next PC

Next SEQ PC

A d d e r

Next SEQ PC

– Programs access a relatively small portion of the address space at any instant of time.

M U X

• Two Different Types of Locality:

Zero? RS1

I F / I D

RS2

R e g F i l e

I D / E X

M U X M U X

Sign Imm Extend

RD

•

L1-Cache L2-Cache Memory

COEN6741 Chap 5.5

M e m o r y

Bigger

Lower Level

11/10/2003

A d d r e s s

Registers Registers “a cache” cache” on variables – software software managed managed First-level cache a cache on second-level cache Second-level cache a cache on memory Memory a cache on disk (virtual memory) TLB a cache on page table Branch-prediction a cache on prediction information?

Disk Files

4

• Small, fast storage used to improve average access time to slow memory. • Exploits spatial and temporal locality • In computer architecture, almost everything is a cache! – – – – – –

cache cntl 8-128 bytes

Blocks

Main Memory M Bytes 100ns 100ns- 300n 300nss $1/ MByte

faster

What is a cache?

RD

E X A / L M U E M

M E M M / e D W m a B o t e a r y h c a C D RD

M U X

a t a D B W

– Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) – Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced referenced soon (e.g., straightline straightline code, array access)

• Last 15 years, HW (hardware) relied on locality for speed

A Modern Memory Hierarchy

The Memory Abstraction

• By taking advantage of the principle of locality: – Present the user with as much memory as is available in the cheapest technology. t he fastest technology. – Provide access at the speed offered by the

• Association of pairs – typically named as byte addresses – often values aligned on multiples of size

• Requires servicing faults on the processor

• Sequence of Reads and Writes • Write binds a value to an address Read of addr returns returns most recent recently ly written written • Read value bound to that address

Processor Control

Datapath

R e g i s t e r s

C a c h e

Speed (ns): 1s Size (bytes): 100s

Second Level Cache (SRAM)

O n C h i p

Tertiary Secondary Storage Storage (Disk/Tape) (Disk)

Main Memory (DRAM)

10s

100s

Ks

Ms

address (name) data (W)

Gs

Ts

COEN6741 Chap 5.9

Memory Hierarchy: Terminology • Hit: Hit: data appears in some block in the upper level (example: Block X) Rate: the fraction of memory access ac cess found in the upper level – Hit Rate: – Hit Time Time:: Time to access the upper level which consists of access time + Time to determine hit/miss

Miss: data needs to be retrieve from a block in the • Miss: lower level (Block Y) – Miss Rate = 1 - (Hit (Hit Rat Rate) e) Penalty:: Time to replace a block in the upper level + – Miss Penalty Time to deliver the block the processor

• Hit Time << Miss Penalty (500 instructions on 21264!) Upper Level Memory

Lower Level Memory

BlkX

From Processor

data (R)

10,000,000s 10,000,000,000s (10s ms) (10s sec)

11/10/2003

To Processor

command (R/W)

BlkY

done COEN6741 Chap 5.10

11/10/2003

Cache Measures rate: fraction found in that level • Hit rate: – So high that usually talk about Miss rate – Miss rate fallacy: as MIPS to CPU performance, miss rate to average memory access time in memory

• Average Memory-Access Time (AMAT) = Hit time + Miss rate x Miss penalty (ns or clocks)

• Miss penalty: penalty: time to replace a block from lower level, including time to replace in CPU – access time: time to lower level = f(latency to lower level) – transfer time: time to transfer block =f(BW between upper & lower levels )

The Cache Design Space • Several interacting dimensions – – – – –

Cache Size

cache size block size associativity replacement policy write-through vs. write-back

Associativity

• Simplicity often wins

• Q1: Where can a block be placed in the upper level? (Block placement) – Fully Associative, Set Associative, Direct Mapped

• Q2: How is a block found if it is in the upper level? (Block identification)

Block Size

• The optimal choice is a compromise – depends on access characteristics » workload » use (I-cache, D-cache, TLB) – depends on technology / cost

Traditional Four Questions for Memory Hierarchy Designers

– Tag/Block

• Q3: Which block should be replaced on a miss? (Block replacement)

Bad

– Random, LRU, FIFO Good Factor A Less

Factor B

More

COEN6741 Chap 5.13

11/10/2003

Q1: Where can a block be placed in the upper level?

• Q4: What happens on a write? (Write strategy) – Write Back or Write Through (with Write Buffer) COEN6741 Chap 5.14

11/10/2003

Q2: How is a block found if it is in the upper level? 

Tag on each block –

•

No need to check index or block offset

Incr Increa easi sing ng asso associ ciat ativ ivit ityy shri shrink nkss inde index, x, expands tag Block address Tag

Index

Block offset

The three portions of an address in a set-associative or direct-mapped cache.

Example cache has 8 block frames and memory has 32 bloc ks

Q3: Which block should be replaced on a miss?

Q4: What happens on a write? Write-through: all writes update cache and underlying • Write-through: memory/cache

Easy for Direct Mapped • Set Associative or Fully Associative: •

always discard cached cached data - most up-to-date up-to-date data data is in – Can always memory – Cache control bit: only a valid bit

Random – LRU (Least Recently Used) –

Write-back:: all writes simply update cache • Write-back

Associativity: 2-way

4-way

8-way

Size

LRU Rand.

LRU Random

LRU

Random

16 KB 64 KB

5.2% 1.9%

4.7% 1.5%

5.3% 1.7%

4.4% 1.4%

5.0% 1.5%

1.15% 1.17% 1.13%

1.13%

1.12%

1.12%

256 KB

5.7% 2.0%

– Can’t just just discard discard cached data data - may have to write write it back to memory – Cache control bits: both valid and dirty bits • Other Advantages: – Write-through: » memory (or other processors) always have latest data » Simpler management of cache – Write-back: » much lower bandwidth, since data often overwritten multiple times » Better tolerance to long-latency memory? COEN6741 Chap 5.18

11/10/2003

Write Buffer for Write Through

Write Policy: (What happens on write-miss?)

Cache Processor

DRAM

Write Buffer

• Write allocate: allocate new cache line in cache – Usually means that you have to do a “read miss” to fill in rest of the cache-line! – Alternative: per/word valid bits

“write-around”): • Write non-allocate (or “write-around”): – Simply send write data through to underlying memory/ca memory/cache che - don’t don’t allocate allocate new cache cache line!

•

A Write Buffer is needed between the Cache and Memory – –

•

Write buffer is just a FIFO: – –

•

Processor: writes data into the cache and the write buffer Memory controller: write contents of the buffer to memory Typical number of entries: 4 Works fine fine if: Store frequency frequency (w.r.t. time) time) << 1 / DRAM DRAM write cycle

Memory system designer’s nightmare: – –

Store Store freq frequen uency cy (w.r.t (w.r.t.. time) time) Write buffer saturation

-> 1 / DRAM DRAM writ writee cycl cyclee

Simplest Cache: Direct Mapped Memory Address

Example: 1 KB Direct Mapped Cache • For a 2 ** N byte cache:

Memory

0

uppermost (32 (32 - N) bits are always always the Cache Tag Tag – The uppermost – The lowest M bits are the Byte Select (Block Size = 2 ** M)

4 Byte Direct Mapped Cache Cache

1

Cache Index

2

Block address

0

3

31

1

4

Cach Cachee Tag Tag

2

5

3

6 8

– Memory location 0, 4, 8, ... etc. – In general: any memory location whose 2 LSBs LSBs of the address address are are 0s – Address<1:0> => cache index

9 A B

Valid Bit

D E F

Cache Tag

Cache Index

Byte Select

Ex: 0x01

Ex: 0x00

Byte 31

:

Byte 1

Byte 63

:

Byte 33 Byte 32 1

:

:

:

• N-way Set Associative Cache versus Direct Mapped Cache:

– N direct mapped caches operates in parallel

– N comparators vs. 1 – Extra MUX delay for the data – Data comes AFTER Hit/Miss decision and set selection

• Example: Two-way set associative cache – Cache Index selects a “set” from the cache – The two tags in the s et are compared to the input in parallel – Data is selected based on the tag t ag result

:

:

Adr Tag Tag

Compare

Cache Block 0

Cache Block 0

:

:

Sel1 1

OR

Mux

0 Sel0

COEN6741 Chap 5.22

Disadvantage of Set Associative Cache

• N-way set associative: associative: N entries for each Cache Index

Cache Data

Byte 992 31

:

11/10/2003

Set Associative Cache

Cache Tag

0

3

COEN6741 Chap 5.21

Cache Index Cache Data

Byte 0

2

Byte 1023

11/10/2003

0

Cache Data 0x50

• Which one should we place in the cache? • How can we tell which one is in the cache?

C

Exam Exampl ple: e: 0x50 0x50

4

Stored as part of the cache “state”

• Location 0 can be occupied by data from:

7

Valid

9

Cache Tag

:

Compare

• In a direct mapped cache, Cache Block is available BEFORE Hit/Miss: continue. Recover later if miss. – Possible to assume a hit and continue.

Valid

:

Valid

Cache Tag

:

:

Adr Adr Tag Tag

Compare

Cache Data

Cache Index Cache Data

Cache Block 0

Cache Block 0

:

:

Sel1 1

OR

Mux

0 Sel0

Cache Tag

Valid

:

:

Compare

Cache performance Block 1 Block address offset <21> < 8> 8> < 5> 5> Tag

CPU address Data Data in out

Index 4 Valid Tag <1> <21>

(256 blocks)

• Miss-oriented Approach to Memory Access: CPUtime

IC

CPI Execution

CPUtime

IC

CPI Execution

Data <256>

2

MemAccess Inst MemMisses Inst

MissRate MissPenalt y

MissPenalt y

CycleTime

CycleTime

– CPIExecution includes ALU and Memory instructions =?

3

• Separating out Memory component entirely

4:1 Mux

– AMAT = Average Memory Access Time – CPIALUOps does not include memory instructions

Write buffer

Lower level memory

The organization of the data cache in the Alpha AXP 21064 microprocessor.

CPUtime

IC

AluOps Inst

CPI

MemAccess

AluOps

Inst

AMAT

CycleTime

AMAT HitTime MissRate MissPenalt y HitTime Inst MissRate Inst MissPenalt y Inst HitTime

ata

MissRate

ata

MissPenalt y

ata

11/10/2003

Impact on Performance • Suppose a processor executes at – Clock Rate = 200 MHz (5 ns per cycle), Ideal (no misses) CPI = 1.1 – 50% arith/logic, 30% ld/st, 20% control

• Suppose that 10% of memory operations get 50 cycle miss penalty • Suppose that 1% of instructions get same miss penalty • CPI = ideal CPI + average stalls stalls per instruction 1.1(cycles/ins) + [ 0.30 (DataMops/ins) x 0.10 (miss/DataMop) x 50 (cycle/miss)] + [ 1 (InstMop/ins) x 0.01 (miss/InstMop) x 50 (cycle/miss)] = (1.1 (1.1 + 1.5 + .5) cycle/ins cycle/ins = 3.1 • 58% of the time the proc is stalled waiting for memory! AMAT=(1/1.3)x[ 1+0.01x50]+(0.3/1.3)x[ ]+(0.3/1.3)x[1+0.1x50 1+0.1x50]=2.54 ]=2.54 • AMAT=(1/1.3)x[1+0.01x50

COEN6741 Chap 5.26

Unified vs.. Split Caches • Unified vs. Separate I&D Proc Unified Cache-1 Unified Cache-2

I-Cache-1

Proc

D-Cache-1

Unified Cache-2

• Example: – 16KB I&D: Inst miss rate=0.64%, Data miss rate=6.47% – 32KB unified: Aggregate miss rate=1.99%

• Which is better (ignore L2 cache)? 75% accesses from instructions (1.0/1.33) – Assume 33% data ops – hit time=1, miss time=50 – Note that data hit has 1 stall for unified cache (only one port)

AMAT Harvard=75%x(1+0.64%x50)+25%x(1+6.47%x50) = 2.05 AMAT Unified=75%x(1+1.99%x50)+25%x(1+1+1.99%x50)= 2.24

Miss Rate Reduction

How to Improve Cache Performance?  

CPUtime  IC  CPI

Execution



Memory Memory accesses accesses Instruction

Miss penalt penaltyy  Clock Clock cycle cycle time time  Miss rate  Miss 

• 3 Cs: Compulsory, Capacity, Conflict 0. 1. 2. 3. 4. 5. 6. 7.

AMAT HitTime MissRate MissPenalt y 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache.

Larger cache Reduce Misses via Larger Block Size Reduce Misses via Higher Associativity Reducing Misses via Victim Cache Reducing Misses via Pseudo-Associativity Reducing Misses by HW Prefetching Instr, Data Reducing Misses by SW Prefetching Data Reducing Misses by Compiler Optimizations

• Danger of concentrating on just one parameter! • Prefetching comes in two flavors: – Binding prefetch: Requests load directly into register. » Must be correct address and register! – Non-Binding prefetch: Load into cache. » Can be incorrect. Frees HW/SW to guess! COEN6741 Chap 5.29

11/10/2003

COEN6741 Chap 5.30

11/10/2003

Where to misses come from?

3Cs Absolute Miss Rate (SPEC92)

• Classifying Misses: 3 Cs – Compulsory—The first access to a block is not in the cache, – –

so the block must be brought into the cache. Also called cold start misses or first reference misses. misses . (Misses in even an Infinite Cache) Capacity —If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later l ater retrieved. (Misses in Fully Associative Size X Cache) Conflict—If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses or interference misses. misses. (Misses in N-way Associative, Size X Cache)

• 4th “C”: caused by by cache cohere coherence. nce. – Coherence - Misses caused

0.14 e p y t r e p e t a r s s i M

1-way

0.12

Conflict

2-way

0.1

4-way

0.08

8-way

0.06

Capacity

0.04 0.02 0 1

2

4

8

6 1

Cache Size (KB)

2 3

4 6

8 2 1

Compulsory

0. Cache Size 0.14

Cache Organization?

1-way

0.12

• •

2-way

0.1

4-way

0.08

8-way

0.06

Assume total cache size not n ot changed: What happens if:

1) Chang Changee Bloc Blockk Size Size::

Capacity

0.04

2) Chang Changee Associ Associati ativit vity: y:

0.02 0 1

2

4

8

6 1

2 3

4 6

3) Change Compiler:

8 2 1

Which of 3Cs is obviously affected?

Compulsory

Cache Size (KB)

• Old rule of thumb: 2x size => 25% cut in miss rate • What does it reduce? • Thrashing reduction!!! 11/10/2003

COEN6741 Chap 5.33

COEN6741 Chap 5.34

11/10/2003

1. Larger Block Size (fixed size & assoc) 25%

0.14 1K

20% Miss Rate

4K

15%

16K 10%

64K

5%

Reduced compulsory misses

2. Higher Associativity

256K

1-way

0.12

Conflict

2-way

0.1

4-way

0.08

8-way

0.06

Capacity

0.04 0.02

0% 6 1

2 3

4 6

Block Size (bytes)

What else drives up block size?

8 2 1

6 5 2

0

Increased Conflict Misses

1

2

4

8

6 1

Cache Size (KB)

2 3

4 6

8 2 1

Compulsory

3Cs Relative Miss Rate

Associativity vs.. Cycle Time

100%

• Beware: Execution time is only final measure! • Why is cycle time tied to hit time?

1-way 80%

Conflict

2-way 4-way 8-way

60%

• Will Clock Cycle time increase?

40%

– Hill [1988] suggested hit time for 2-way vs.. 1-way external cache +10%, internal + 2% – suggested big and dumb caches

Capacity

20%

Effective cycle time of assoc pzrbski ISCA

0% 1

2

4

8

6 1

Flaws: for fixed block size Good: insight => inventionCache Size (KB)

2 3

4 6

8 2 1

Compulsory COEN6741 Chap 5.37

11/10/2003

Example: Avg. Memory Access Time vs.. Miss Rate • Example: assume CCT = 1.10 for 2-way, 1.12 for 4-way, 1.14 for 8-way vs.. CCT direct mapped Cache Size (KB) 1-way 1 2.33 2 1.98 4 1.72 8 1.46 16 1.29 32 1.20 64 1.14 128 1.10

Associativity 2-way 4-way 2.15 2.07 1.86 1.76 1.67 1.61 1.48 1.47 1.32 1.32 1.24 1.25 1.20 1.21 1.17 1.18

8-way 2.01 1.68 1.53 1.43 1.32 1.27 1.23 1.20

(Red means A.M.A A.M.A.T. .T. not not impro improved ved by by more associa associativity tivity))

COEN6741 Chap 5.38

11/10/2003

3. Victim Cache • Fast Hit Time + Low Conflict => Victim Cache • How to combine fast hit time of direct mapped yet still avoid conflict misses? • Add buffer to place data discarded from cache • Jouppi [1990]: 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache • Used in Alpha, HP machines

TAGS

DATA

Tag and Comparator

One Cache line of Data

Tag and Comparator


Tag and Comparator


Tag and Comparator

One Cache line of Data To Next Lower Level In Hierarchy

4. Pseudo-Associativity • How to combine fast hit time of Direct Mapped and have the lower conflict misses of 2-way SA cache? • Divide cache: on a miss, check other half of cache to see if there, if so have a pseudo-hit (slow hit)

• E.g., Instruction Prefetching – Alpha 21064 fetches 2 blocks on a miss “stream buffer” buffer” – Extra block placed in “stream – On miss check stream buffer

• Works with data blocks too:

Hit Time Pseudo Hit Time

5. Hardwa Hardware re Prefe Prefetching tching of Instruct Instructions ions & Data

– Jouppi [1990] 1 data stream buffer got 25% misses from 4KB cache; 4 streams got 43% – Palacharla & Kessler [1994] for scientific programs for 8 streams got 50% to 70% of misses from 2 64KB, 4-way set associative caches

Miss Penalty

Time

• Prefetching relies on having extra memory bandwidth that can be used without penalty

• Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles – Better for caches not tied directly to processor (L2) – Used in MIPS R1000 L2 cache, similar in UltraSPARC COEN6741 Chap 5.41

11/10/2003

6. Sof Softwa tware re Pre Prefet fetchi ching ng Data Data • Data Prefetch – Load data into register (HP PA-RISC loads) – Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9) – Special prefetching instructions cannot cause faults; a form of speculative execution

• Prefetching comes in two flavors: – Binding prefetch: Requests load directly into register. » Must be correct address and register! – Non-Binding prefetch: Load into cache. » Can be incorrect. Faults?

• Issuing Prefetch Instructions takes time – Is cost of prefetch issues < savings in reduced misses? – Higher superscalar reduces difficulty of issue bandwidth

COEN6741 Chap 5.42

11/10/2003

7. Compiler Optimizations • McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache, 4 byte blocks in software • Instructions – Reorder procedures in memory so as to r educe conflict misses – Profiling to look at conflicts(using tools they developed)

• Data – Merging Arrays : improve spatial locality by single array of compound elements vs.. 2 arrays – Loop Interchange : change nesting of loops to access data in order stored in memory – Loop Fusion : Combine 2 independent loops that have same looping and some variables overlap – Blocking : Improve temporal locality by accessing “blocks” of data repeatedly vs.. going down whole columns or rows

Loop Interchange Example

Merging Arrays Example

/* After: 1 array of stuctures */ struct merge { int val; int key; }; struct merge merged_array[SIZE];

/* Before */ for (k = 0; k < 100; k = k+1) for (j = 0; j < 100; j = j+1) for (i = 0; i < 5000; i = i+1) x[i][j] = 2 * x[i][j]; /* After */ for (k = 0; k < 100; k = k+1) for (i = 0; i < 5000; i = i+1) for (j = 0; j < 100; j = j+1) x[i][j] = 2 * x[i][j];

Reducing conflicts between val & key; improve spatial locality

Sequential accesses instead of striding through memory every 100 words; improved spatial locality

/* Before: 2 sequential arrays */ int val[SIZE]; int key[SIZE];

COEN6741 Chap 5.45

11/10/2003

Loop Fusion Example

COEN6741 Chap 5.46

11/10/2003

Blocking Example

/* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) a[i][j] = 1/b[i][j] * c[i][j] c[i][j]; ; for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) d[i][j] = a[i][j] + c[i][j] c[i][j]; ; /* After */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) { a[i][j [j] ] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j];}

/* Before */ for (i (i = 0; i < N; i = i+1) for (j (j = 0; j < N; j = j+1) {r = 0; for (k (k = 0; k < N; k = k+1){ r = r + y[i y[i][k ][k]*z[k ]*z[k][j ][j];}; x[i x[i][j ][j] = r; };

2 misses per access to a & c vs.. one miss per access; improve spatial locality

• Capacity Misses a function of N & Cache Size:

• Two Inner Loops: – Read all NxN elements of z[] – Read N elements of 1 row of y[] repeatedly – Write N elements of 1 row of x[] – 2N3 + N2 => (assuming no conflict; otherwise …)

• Idea: Idea: compute compute on BxB BxB submatrix submatrix that fits fits

Reducing Conflict Misses by Blocking

Blocking Example

0.1

/* After */ for (jj (jj = 0; jj jj < < N; jj = jj jj+B) +B) for (kk (kk = 0; kk kk < < N; kk = kk kk+B) +B) for (i (i = 0; i < N; i = i+1) for (j (j = jj jj; ; j < min(jj min(jj+B-1,N); +B-1,N); j = j+1) {r = 0; for (k (k = kk kk; ; k < min(kk min(kk+B-1,N); +B-1,N); k = k+1) { r = r + y[i y[i][k ][k]*z[k ]*z[k][j ][j];}; x[i x[i][j ][j] = x[i x[i][j ][j] + r; };

Fully Associative Cache 0 0

• B called Blocking Factor • Capacity Misses from 2N3 + N2 to 2N3/B+N2 • Conflict Misses Too?

50

100

• Conflict misses in caches not FA vs.. Blocking size

Summary of Compiler Optimizations to Reduce Cache Misses (by hand)

[1991] a blocking factor of 24 had a fifth the misses – Lam et al [1991] vs.. 48 despite both fit in cache COEN6741 Chap 5.50

11/10/2003

Improving Cache Performance

vpenta (nasa7) gmty (nasa7)

1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache.

tomcatv btrix (nasa7) mxm (nasa7) spice cholesky (nasa7) compress 1

1.5

2

2.5

3

Performance Improvement merged arrays

loop interchange

loop fusion

blocking

150

Blocking Factor

COEN6741 Chap 5.49

11/10/2003

Direct Mapped Cache

0.05

Reducing Miss Penalty  

CPUtime  IC  CPI

Execution



Memory Memory accesses accesses Instruction

1. Read Priority over Write on Miss

Miss penalt penaltyy  Clock Clock cycle cycle time time  Miss rate  Miss 

• Four techniques – – – –

CPU

Read priority over write on miss Early Restart and Critical Word First on miss Non-blocking Caches (Hit under Miss, Miss under Miss) Second Level Cache

in out

Write Buffer

• Can be applied recursively to Multilevel Caches – Danger is that time to DRAM will grow with multiple levels in between – First attempts at L2 caches can make things worse, since increased worst case is worse

write buffer

• Out-of-order CPU can hide L1 data cache miss ( 3–5 clocks), but stall on L2 miss ( 40–100 clocks)? COEN6741 Chap 5.53

11/10/2003

1. Read Priority over Write on Miss • Write-through w/ write buffers => RAW conflicts with main memory reads on cache misses – If simply wait for write buffer to empty, might increase read miss penalty (old MIPS 1000 by 50% ) – Check write buffer contents before read; if no conflicts, let the memory access continue

• Write-back want buffer to hold displaced blocks – Read miss replacing dirty block – Normal: Write dirty block to memory, and then do the read – Instead copy the dirty block to a write buffer, then do the read, and then do the write – CPU stall less since restarts as soon as do read

DRAM (or lower mem) COEN6741 Chap 5.54

11/10/2003

2. Early Restart and Critical Word First • Don’t wait for full block to be loaded before restarting CPU – Early restart—As restart—As soon as the requested word of the block arrives, send it to the CPU and a nd let the CPU continue execution First—Request the missed word first from memory – Critical Word First—Request and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block. Also called wrapped fetch and requested word first

• Generally useful only in large blocks, • Spatial locality => tend to want next sequential word, so not clear if benefit by early restart

block

Value of Hit Under Miss for SPEC

3. Non-blocking Caches

Hit Under i Misses 2

• Non-blocking cache or lockup-free cache allow data cache to continue to supply cache hits during a miss

1.8 1.6 1.4

– requires F/E bits on registers or out-of-order execution – requires multi-bank memories

– Significantly increases the complexity of the cache controll er as there can be multiple outstanding memory accesses – Requires multiples memory banks (otherwise cannot support) – Pentium Pro allows 4 outstanding memory misses COEN6741 Chap 5.57

4. Add a Second-level Cache • L2 Equations AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1 Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2 AMAT = Hit TimeL1 + Miss RateL1 x (Hit TimeL2 + Miss RateL2 + Miss PenaltyL2)

• Definitions: rate— misses in this this cache divided divided by the total total number number of – Local miss rate— memory accesses to this cache (Miss rateL2) rate—misses in this cache divided by the total number of – Global miss rate—misses memory accesses generated by the CPU – Global Miss Rate is what matters

1->2

1

• “hit under miss” miss” reduces the effective miss penalty by working during miss vs.. ignoring CPU requests • “hit under multiple miss” miss” or “miss “miss under miss” miss” may further lower the effective miss penalty by overlapping multiple misses

11/10/2003

0->1

1.2

2->64

0.8

Base

0.6 0.4

0->1 1->2 2->64 Base

“Hit under n Misses”

0.2 0 t t o t n q e

o p s s s i l e r x p s e

Integer

s s e r p m o c

r p v t 2 a p e p p a s j p c l f m d o t m

c r 5 u o 6 d 2 d n n 5 o c e v p 2 i 2 d 2 a j d o v r l u w l m d d s y a w h m s

7 6 a g s a 2 n e c i p s

a r o

Floating Point

• FP programs on average: AMAT= 0.68 -> 0.52 -> 0.34 -> 0.26 • Int programs on average: AMAT= 0.24 -> 0.20 -> 0.19 -> 0.19 • 8 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss COEN6741

11/10/2003

Chap 5.58

Comparing Local and Global Miss Rates • 32 KByte KByte 1st level level cache; cache; Increasing 2nd level cache • Global miss rate close to single level cache rate provided L2 >> L1 • Don’t use local miss rate • L2 not tied to CPU clock cycle! • Cost & A.M.A.T. • Generally Fast Hit Times and fewer misses • Since hits are few, target miss reduction

Linear

Cache Size Log

Cache Size

L2 Cache Block Size & A.M.A.T.

Reducing Misses: Which apply to L2 Cache?

Relative CPU Time

• Reducing Miss Rate

2 1.9 1.8 1.7 1.6 1.5 1.4 1.3 1.2 1.1 1

1. Reduce Misses via Larger Block Size 2. Reduce Conflict Misses via Higher Associativity 3. Reducing Conflict Misses via Victim Cache 4. Reducing Conflict Misses via Pseudo-Associativity 5. Reducing Misses by HW Prefetching Instr, Data 6. Reducing Misses by SW Prefetching Data 7. Reducing Capacity/Conf. Misses by Compiler Optimizations

1.95

1.54 1.36

16

1.28

1.27

32

64

1.34

128

256

512

Block Size COEN6741 Chap 5.61

11/10/2003

Improving Cache Performance 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache.

• 32KB L1, 8 byte path to memory 11/10/2003

1. Small and Simple Caches • Why Alpha 21164 has 8KB Instruction and 8KB data cache + 96KB second level cache? – Small data cache and clock rate

• Direct Mapped, on chip

COEN6741 Chap 5.62

2. Avoiding Address Translation

Virtually Addressed Caches

• Send virtual address to cache? Called Virtually Addressed Cache or just Virtual Cache vs.. Physical Cache

VA Tags

TB

– Add process identifier tag that identifies process as well as address within process: can’t get a hit if wrong process COEN6741 Chap 5.65

3. Pipelined Writes • Pipeline Tag Check and Update Cache as separate stages; current write tag check & previous write cache update • Only STORES in the pipeline; empty during a miss

Check r1 --M[r1]<-r2 & check r3

Buffer”; must be checked on reads; • “Delayed Write Buffer”; either complete write or read from buffer

$

TB

VA

PA L2 $

– HW guarantees every cache block has unique physical address – SW guarantee : lower n bits must have same address; as long as covers index field & direct mapped, they must be unique; called page coloring

11/10/2003

VA PA Tags

$

PA TB

$

• Solution to cache flush

CPU VA

VA

• Solution to aliases

Store r2, (r1) Add Sub Store r4, (r3)

CPU

CPU

– Every time process is switched logically must flush the cache; otherwise get false hits » Cost is time to flush + “compulsory” misses from empty cache synonyms); ); – Dealing with aliases (sometimes called synonyms Two different virtual addresses addresses map to same physical address – I/O must interact with cache, so need virtual address

MEM

MEM

Conventional Organization

Virtually Addressed Cache Translate only on miss Synonym Problem

11/10/2003

MEM

PA

PA

Overlap $ access with VA translation: requires $ index to remain invariant COEN6741 across translation Chap 5.66

Case Study: MIPS R4000 • 8 Stage Pipeline: – IF–first half of fetching of instruction; PC selection happens here as well as initiation of instruction cache access. – IS–second half of access to instruction cache. – RF–instruction decode and register fetch, hazard checking and also instruction cache hit detection. – EX–execution, which includes effective address calculation, ALU operation, and branch target computation and condition evaluation. – DF–data fetch, first half of access to data cache. – DS–second half of access to data cache. – TC–tag check, determine whether the data cache access hit. – WB–write back for loads and register-register operations.

• What is impact on Load delay? – Need 2 instructions between a load and its use!

R4000 Performance

Case Study: MIPS R4000 TWO Cycle Load Latency

IF

IS IF

IF IS THREE Cycle IF Branch Latency (conditions evaluated during EX phase) Delay slot plus two stalls Branch likely cancels delay slot

RF IS IF

EX RF IS IF

DF EX RF IS IF

DS DF EX RF IS IF

TC DS DF EX RF IS IF

WB TC DS DF EX RF IS IF

RF IS IF

EX RF IS IF

DF EX RF IS IF

DS DF EX RF IS IF

TC DS DF EX RF IS IF

WB TC DS DF EX RF IS IF

if not taken

• Not ideal CPI of 1: – Load stalls (1 or 2 clock cycles) – Branch stalls (2 cycles + unfilled slots) stalls: RAW data hazard (latency) – FP result stalls: stalls: Not enough FP hardware (parallelism) – FP structural stalls: 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0

t t o t n q e

Base

o s s e r p s e

c c g

Load stalls

i l

c u d o d

Branch stalls

7 a s a n

a r o

r o c 2 u s

6 g 2 e c i p s

FP result stalls

v t a c m o t

FP structural stalls

COEN6741 Chap 5.69

11/10/2003

What is the Impact of What You’ve Learned About Caches? 1000

• 1960-1985: Speed = ƒ(no. operations) • 1990 100 – Pipelined Execution & Fast Clock Rate 10 – Out-of-Order execution – Superscalar Instruction Issue 1 0 1 2 3 4 5 6 7 8 9 0 • 1998: Speed = 8 8 8 8 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 1 1 1 1 1 1 1 1 1 1 1 ƒ(non-cached memory accesses) • What does this mean for – Compilers?,Operating Systems?, Algorithms? Data Structures?

Cache Optimization Summary CPU

Processor-Memory Performance Gap: (grows 50% / year) DRAM

1 9 9 1

2 3 4 5 6 7 8 9 9 9 9 9 9 9 9 9 9 9 9 9 9 1 1 1 1 1 1 1

COEN6741 Chap 5.70

11/10/2003

9 0 9 0 9 0 1 2

e t a r s s i m

y s t l s a i n m e p e m i t t i h

Technique Larger Block Size Higher Associativity Victim Caches Pseudo-Associative Caches HW Prefetching of Instr/Data Compiler Controlled Prefetching Compiler Reduce Misses Priority to Read Misses Early Restart & Critical Word 1st Non-Blocking Caches Second Level Caches Better memory system Small & Simple Caches Avoiding Address Translation Pipelining Caches

MR + + + + + + +

MP HT – –

+ + + + + –

+ + +

Complexity 0 1 2 2 2 3 0 1 2 3 2 3 0 2 2

Alpha Memory Performance: Miss Rates of SPEC92

Cache Cross Cutting Issues

100.00% AlphaSort Espresso

P orts • Superscalar CPU & Number Cache Ports must match: number memory accesses/cycle? • Speculative Execution and non-faulting option on memory/TLB • Parallel Execution vs.. Cache locality

e t a R s s i M

and consistency consistency of data data between between cache cache • I/O and and memory

COEN6741 Chap 5.73

Alpha CPI Components •

Ear

Alvinn

Mdljp2

Nasa7

8K I$

Instructio Instructionn stall: stall: branch mispredi mispredict ct (green); (green); Data cache (blue); Instruction cache (yellow); L2$ (pink) ( pink) Other: Other: compute compute + reg conflicts, conflicts, structu structural ral conflicts conflicts

D $8K

1.00%

L2

0.10%

– Caches => multiple copies of data – Consistency by HW or by SW? – Where connect I/O to computer?

•

Mdljsp2

10.00%

– Want far separation to find independent operations vs.. want reuse of data accesses to avoid misses

11/10/2003

Sc

I$ miss = 6% D$ miss = 32% L2 miss = 10%

I$ miss = 2% D$ miss = 13% L2 miss = 0.6%

2M

I$ miss = 1% D$ miss = 21% L2 miss = 0.3%

0.01%

COEN6741 Chap 5.74

11/10/2003

Predicting Cache Performance from Different Prog. (ISA, compiler, ...) 35%

D$, Tom 30%

5.00 4.50 4.00 3.50 L2

3.00 I P C

I$ D$

2.50

I Stall

2.00

Other

1.50 1.00

• 4KB Data cache miss rate 8%,12%, or 25% 28%? D$, gcc • 1KB Instr cache miss 20% rate 0%,3%,or 10%?Miss Rate • Alpha vs.. MIPS D$, esp 15% for 8KB Data $: 17% vs.. 10% 10% • Why 2X Alpha v. I$, gcc MIPS?

D: gcc D: espresso I: gcc I: espresso I: tomcatv

5%

I$, esp

0.50 0%

0.00 AlphaSort

D: tomcatv

Espresso

Sc

Mdljsp2

Ear

Alvinn

Mdljp2

1

2

I$, Tom

4

8

16

32

64

128

Main Memory Deep Background

Main Memory Background •

Performance of Main Memory: Latency: Cache Miss Penalty – Latency: Time: time between request and word arrives » Access Time: Time: time between requests » Cycle Time: – Bandwidth Bandwidth:: I/O & Large Block Miss Penalty (L2)

•

•

•

•

Main Memory is DRAM DRAM:: Dynamic Random Access Memory – Dynamic since needs to be refreshed periodically (8 ms, 1% time) – Addresses divided into 2 halves (Memory as a 2D matrix): » RAS or Row Access Strobe » CAS or Column Access Strobe

•

•

•

“Out-of-Core”, “In-Core,” “Core Dump”? “Core memory”? Non-volatile, magnetic Lost to 4 Kbit Kbit DRAM (today (today using using 64Kbit 64Kbit DRAM) Access time 750 ns, cycle time 1500-3000 ns

Cache uses SRAM SRAM:: Static Random Access Memory – No refresh (6 transistors/bit vs.. 1 transistor /bit, area is 10X) – Address not divided: Full addreess

•

Size: DRAM/SRAM  4-8 Size: 4-8,, Cost/Cycle time: time: SRAM/DRAM  8-16 COEN6741 Chap 5.77

11/10/2003

DRAM Physical Organization (4 Mbit)

DRAM Logical Organization (4 Mbit)

11

Column Decoder … Sense Amps & I/O I/ O

Column Address

Memory Array (2,048 x 2,048) Storage Word Line Line Cell

• Square root of bits per RAS/CAS

I/O I/O

8 I/Os

…

I/O I/O

D Row Address

A0…A10

COEN6741 Chap 5.78

11/10/2003

Q

D Block Row Dec. 9 : 512

Block Row Dec. 9 : 512

I/O I/O

Block 0

…

…



I/O I/O

Block 3

Q

8 I/Os

4 Key DRAM Timing Parameters

DRAM Performance

• tRAC: minimum time from RAS line falling to the valid data output.

(tRAC) DRAM can • A 60 ns (t – perform a row access only every 110 ns (t ( tRC) – perform column access (t ( tCAC) in 15 ns, but time between column accesses is at least 35 ns (t ( tPC).

– Quoted as the speed of a DRAM when buy – A typical 4Mb DRAM tRAC = 60 ns – Speed of DRAM since on purchase sheet?

» In practice, external address delays and turning around buses make it 40 to 50 ns

• tRC: minimum time from the start of one row access to the start of the next.

• These times do not include the time to drive the addresses off the microprocessor nor the memory controller overhead!

– tRC = 110 ns for a 4Mbit DRAM with a t RAC of 60 ns

• tCAC: minimum time from CAS line falling to valid data output. – 15 ns for a 4Mbit DRAM with a t RAC of 60 ns

• tPC: minimum time from the start of one column access to the start of the next. – 35 ns for a 4Mbit DRAM with a t RAC of 60 ns COEN6741 Chap 5.81

11/10/2003

COEN6741 Chap 5.82

11/10/2003

DRAM History

DRAM DR AM Fut Futur ure: e: 1 Gbi Gbitt DR DRAM AM

• DRAMs: capacity +60%/yr, cost –30%/yr – 2.5X cells/area, 1.5X die size in

3 years



DRAM fab fab line line costs costs $2B $2B • ‘98 DRAM – DRAM only: density, leakage v. speed

• Rely on increasing no. of computers & memory per computer (60% market) – SIMM or DIMM is replaceable unit => computers use any generation DRAM

• Commodity, second source industry => high volume, low profit, conservative – Little organization innovation in 20 years

• Order of importance: 1) Cost/bit 2) Capacity – First RAMBUS: 10X BW, +30% cost => little impact

• • • • • •

Blocks Clock Data Pins Die Size Metal Layers Technology

Mitsubishi Samsung 512 x 2 Mbit 1024 x 1 Mbit 200 MHz 250 MHz 64 16 24 x 24 mm 31 x 21 mm 3 4 0.15 micron 0.16 micron

• Wish could do this for Microprocessors!

Main Memory Performance Main Memory Performance •

Simple:: Simple –

•

CPU, Cache, Bus, Memory same width (32 or 64 bits)

Wide:: Wide –

•

• Timing model (word size is 32 bits) – 1 to send address, – 6 access time, 1 to send data – Cache Block is 4 words

CPU/Mux 1 word; Mux/ Cache, Bus, Memory N words (Alpha: 64 bits & 256 bits; UtraSPARC 512)

= 4 x (1+6+1) = 32 • Simple M.P. • Wide M.P. =1+6+1 =8 • Interleaved M.P. = 1 + 6 + 4x1 = 11

Interleaved:: Interleaved –

CPU, Cache, Bus 1 word: Memory N Modules (4 Modules); example is word interleaved COEN6741 Chap 5.86

11/10/2003

Independent Memory Banks • Memory banks for independent i ndependent accesses vs.. faster sequential accesses – Multiprocessor – I/O – CPU with Hit under n Misses, Non-blocking Cache

Superbank: all memory active on one block • Superbank: transfer (or Bank Bank)) Bank: portion portion within within a superb superbank ank that is word word • Bank: interleaved (or Subbank Subbank))

…

Independent Memory Banks • How many banks? number banks  number clocks to access word in bank – For sequential accesses, otherwise will return to original bank before it has next word ready – (like in vector case) • Increasing DRAM => fewer chips => harder to have banks

Fast Bank Number

Avoiding Bank Conflicts

• Chinese Remainder Theorem

• Lots of banks

int x[256][512]; for (j = 0; j < 512; j = j+1) for (i = 0; i < 256; i = i+1) x[i][j] = 2 * x[i][j];

• Even with 128 banks, since 512 is multiple of 128, conflict on word accesses • SW: loop interchange or declaring array not power of 2 (“array padding”) • HW: Prime number of banks – – – – –

bank number = address mod number of banks address within bank = address / number of words in bank modulo & divide per memory access with prime no. banks? address within bank = address mod number words in bank bank number? easy if 2 N words per bank

11/10/2003

COEN6741 Chap 5.89

Fast Memory Systems: DRAM specific

As long as two sets of integer integerss ai and bi follow follow these rules rules b i  mod a i , 0  b i  a i , 0  x  a 0  a 1  a 2  and that that ai and aj are co-prime co-prime if i ° j, then the the integer integer x has has only one one solution (unambiguous mapping): – bank number = b 0, number of banks = a 0 (= 3 in example) – address within bank = b 1, number of words in bank = a 1 (= 8 in example) – N word address 0 to N-1, prime no. banks, words power of 2

Bank Number: Address within Bank: 0 1 2 3 4 5 6 7 11/10/2003

0

Seq. In Interleaved 1 2

0 3 6 9 12 15 18 21

1 4 7 10 13 16 19 22

Modulo In Interleaved 0 1 2

2 5 8 11 14 17 20 23

0 9 18 3 12 21 6 15

16 1 10 19 4 13 22 7

8 17 2 11 20 5 14 23

DRAM Latency >> BW

• Multiple CAS accesses: several names (page mode) – Extended Data Out (EDO): (EDO): 30% faster in page mode

• New DRAMs DRAMs to address address gap; what will they cost, will they survive? – RAMBUS: RAMBUS: startup company; reinvent DRAM interface » Each Chip a module vs.. slice of memory » Short bus between CPU and chips » Does own refresh » Variable amount of data returned » 1 byte / 2 ns (500 MB/s per chip) DRAM: 2 banks on chip, a clock signal to DRAM, – Synchronous DRAM: transfer transfer synchronous synchronous to system system clock (66 - 150 MHz) – Intel claims RAMBUS Direct (16 b wide) is future PC memory

• Niche memory or main memory? e.g., Video RAM for frame buffers, DRAM + fast serial output

• More App Bandwidth => Cache misses => DRAM RAS/CAS • Application BW => Lower DRAM Latency • RAMBUS, Synch DRAM increase BW but higher latency • EDO DRAM < 5% in PC

Proc I$

D$ L2$ Bus

D R A M

D R A M

D R A M

D R A M

COEN6741 Chap 5.90

Main Memory Summary

DRAM Crossroads?

• Wider Memory

• After 20 years of 4X every 3 years, running into into wall? wall? (64Mb (64Mb - 1 Gb) Gb)

• Interleaved Memory: for sequential or independent accesses

• How can can keep $1B $1B fab lines full full if buy fewer fewer DRAMs DRAMs per per comp compute uter? r?

• Avoiding bank conflicts: SW & HW

• Cost/bit –30%/yr if stop 4X/3 yr?

• DRAM specific optimizations: page mode & Specialty DRAM

• What will happen to $40B/yr DRAM industry?

• DRAM future less rosy?

COEN6741 Chap 5.93

11/10/2003

DRAM DR AMss pe perr PC ov over er Tim Timee

Virtual Memory

DRAM DRA M Gen Genera eratio tion n

4 e z i S 8 y r o 16 m e M 32 m 64 u m i n 128 i M

MB MB MB MB MB

‘8 6 1 Mb 32

‘8 9 4 Mb 8 16

‘9 2 ‘9 6 ‘9 9 ‘0 2 16 Mb 64 Mb 256 Mb 1 Gb 4 8

COEN6741 Chap 5.94

11/10/2003

2 4

1

8

2

MB

4

1

256 MB

8

2

• A virtual memory is a memory hierarchy, usually consisting of at least main memory and disk, in which the processor issues all memory references as effective addresses in a flat address space. • All translations to primary and secondary addresses are handled transparently, thus providing the illusion of a flat address space. • Recall that disk accesses may require 100,000 clock cycles to complete, due to the slow access time of the disk subsystem.

Basic Issues in VM System Design size of information blocks that are transferred from secondary to main storage (M) block of information brought into M, and M is full, then some region of M must be be released to make room room for the new new block --> replacement policy which region region of M is to hold hold the new new block --> placement policy missing item fetched from secondary memory only on the occurrence occurrence of a fault --> demand load policy disk

mem

cache

Addressing and Accessing a Two-Level Hierarchy

The computer system, HW or SW, must perform any address translation System that is address required:

Memory management unit (MMU)

Address in secondary memory

Miss

Secondary level

Translation function (mapping tables, permissions, etc.)

Block

reg

pages

Hit

frame

Primary level

Address in primary memory

Paging Organization

Word

size virtual and physical address space partitioned into blocks of equalsize

Two ways of forming the address: Segmentation and Paging. Paging is more common. Sometimes the two are used together, one “on top of” the other. More about address translation and paging next ...

page frames pages

Paging vs. Segmentation System address add ress Block

Paging Organization V.A.

P.A.

0 1024

frame 0 1

1K 1K

7168

7

1K

Addr Trans MAP

System address add ress Word

Block

Word

Physical Memory Lookup table

0 1024

31744 31 744

Lookup table

page 0 1

31

1K 1K

also unit of transfer from virtual to physical 1K memory

Virtual Memory

Address Mapping Base address Block

Word

+

VA Word

Primary address Primary address

Paging

10 disp

page no.

Segmentation

Page Table Page Table Base Reg index into page table

V

Access Rights

PA

table located in physical memory

+ physical memory address

unit of mapping

actually, concatenation concatenation is more likely

Transla Tra nslatio tionn Lookas Lookaside ide Buf Buffer fer

Main memory FFF

Segmentation Organization

0

Segment Segment 5 Gap

0 Virtual memory addresses

Segment Segment 1 Segment Segment 6

0

Physical memory addresses

A way to speed up translation is to use a special cache of recent recently ly used used page page table table entrie entriess -- this this has has many many names, but the most frequently used is Translation Look Lo okas asid idee Bu Buff ffer er or TLB Virtual Address Physical Address Dirty Ref Valid Access

Gap

0

Segment Segment 9

0

Segment Segment 3

0000

Notice that each segment’s virtual address starts at 0, different from its physical address. • Repeated movement of segments into and out of physical memory will result in gaps between segments. This is called external fragmentation. fragmentation. •

•

Compaction routines must be occasionally run to remove these fragments.

Really just a cache on the page table mappings TLB access time comparable to cache access time (much less than main memory access time) COEN6741 Chap 5.102

11/10/2003

Transla Tra nslatio tionn Lookas Lookaside ide Buf Buffer ferss • Just like any other cache, the TLB can be organized as fully associative, set associative, or direct mapped • TLBs are usually usually small, small, typically typically not more more than 128 128 256 entries entries even on high end machines. machines. This permits permits fully Associative lookup on these machines. – Most mid-range machines use small n-way set associative organizations. hit miss VA PA TLB Main Cache CPU Lookup Memory

Translation with a TLB

miss

hit

Address Translation VA CPU

Translation

and Cache miss

PA Cache

Main Memory

hit data

• Page table is a large data structure in memory • Two memory accesses for every load, store, or instruction fetch!!! • Virtually addressed cache? – synonym problem

Translation data

• Cache the address translations? • If index is physical part of address, can start tag access in parallel with translation so that can compare to physical tag

Memory Hierarchy (Summary)

Overlapped Cache & TLB Access 32

TLB

index

assoc lookup

Cache

1K

4 bytes 10

2 00

PA

Hit/ Miss

20 page #

PA

12 disp

Data

Hit/ Miss

• The memory hierarchy: from fast and expensive to slow and cheap: Registers Cache Main Memory Disk • At first, consider just two adjacent levels in the hierarchy • The cache: High speed and expensive – Direct mapped, associative, set associative

• Virtual memory—makes the hierarchy transparent – Translate the address from CPU’s logical address to the physical address where the information is actually stored – The “TLB” helps in speeding the address translation process

= IF cache hit AND (cache tag = PA) then deliver data to CPU ELSE IF [cache miss OR (cache tag = PA)] and TLB hit THEN access memory with the PA from the TLB ELSE do standard VA translation

• Memory management—how to move information back and forth COEN6741 Chap 5.106

11/10/2003

TLB Practical Memory Hierarchy • Issue is NOT inventing new mechanisms • Issue is taste in selecting between many alternatives in putting together a memory hierarchy that fit well together – e.g., L1 Data cache write through, L2 Write back

cy cle, – e.g., L1 small for fast hit time/clock cycle, – e.g., L2 big enough to avoid going to DRAM?

and

Virtual Memory

• Caches, TLBs, Virtual Memory all understood by examining how they deal with 4 questions: 1) Where can block be placed? 2) How is block found? 3) What What block block is repalced repalced on miss? 4) 4) How are are writes handled? • Page tables map virtual address to physical address • TLBs make virtual virtual memory memory pract practical ical – Locality in data => locality in addresses of data, temporal and spatial

• TLB misses are significant in processor performance – funny times, as most systems can’t access all of 2nd level cache without TLB misses!

• Today VM allows many processes to share single memory without having to swap all processes to disk; today VM protection is more important than memory hierarchy

Alph Al phaa 21 06 064 4 • Separate Instr & Data TLB & Caches • TLBs fully associative • TLB updates in SW (“Priv Arch Libr”)

Instr

Data

• Caches 8KB direct mapped, write thru • Critical 8 bytes first • Prefetch instr. stream buffer • 2 MB L2 cache, direct mapped, WB (off-chip) • 256 bit path to main memory, 4 x 64-bit modules • Victim Buffer: to give read priority over write • 4 entry write buffer between D$ & L2$

Write Buffer Stream Buffer

Victim Buffer

UCB 35 DAP Spr.‘98 © Spr.‘98 © UCB

Computer Architecture Cache Design

Recommend Documents