!"#$#%&'' )##*!
Computer Systems Architecture
Winter 2015 Midterm Exam II Instructor: Instructor:
Prof. Lei He
Solution
1
Problem 1. (10 points) Explain terms or answer short problems. For example, Program counter (PC) is the register containing the address of the instruction in the program being executed. (Hint: if you do not know how to explain the term precisely, you may use examples to explain). (1) Explain the concept of delayed load AND give an example piece of code
The loaded data is available one clock cycle after the load instruction. For example in the code: lw add
$1, $3,
0($2) $1, $4
The loaded data $1 from the first instruction is not available right after the instruction. For the add instruction, we need to add nop before it or stall one clock cycle to wait for $1 to be available. (2) Explain the concept of loop unrolling and why we perform loop unrolling
Unroll the loop n times and rename the registers to make a big loop. Loop unrolling leads to more instructional level parallelism and therefore improve performance. (3) Name three techniques (in either software or hardware) to resolve branch
hazard or reduce performance loss of branch hazard Stall until branch is known, guess branch direction, reduce branch delay, delayed branch (always execute instruction after branch).
(4) State the conditions to share hardware between different stages of multi-
cycle implementation A piece of hardware can be shared if it is used by a same instruction in different clock cyles. For instance, PC increment and R-type execution both make use of the same ALU, but they do so on different cycles.
(5) A
single-cycle implementation may be divided into five stages for pipelining. Compare the average CPI between single-cycle and ideal pipelining implementations and explain why pipelining may improve performance.
The average CPI of both single-cycle and ideal pipelining implementations is 1. But the critical path length of the single-cycle implementation is much larger (usually N times larger, where N is the number of pipeline stages, N=5 in the problem) than the ideal pipeline. Therefore, pipelining can improve performance.
2
Problem 2. (10 points) In this exercise, we examine how data dependences affect execution in the basic 5-stage pipeline described in textbook. Problems in this t his exercise refer to the following sequence of instructions:
lw $5, -16($5) sw $5, -16($5) add $5, $5, $5 Also, assume the following cycle times for each of the options related to forwarding: Without Forwarding
With Full Forwarding
With ALU-ALU Forwarding only
220ps
240ps
230ps
1) Indicate dependences and their type.
I1: lw $5,-16($5) I2: sw $5,-16($5) I3: add $5,$5,$5
RAW on $5 from I1 to I2 and I3 WAR on $5 from I1 and I2 to I3 WAW on $5 from I1 to I3
2) Assume there is no forwarding in this pipelined processor. Indicate hazards and add NOP
instructions to eliminate them. In the basic five-stage pipeline WAR and WAW dependences do not cause any hazards. Without forwarding, any RAW dependence between an instruction and the next two instructions (if register read happens in the second half of the clock cycle and the register write happens in the first half). The code that eliminates these hazards by inserting nop instructions is: lw $5,-16($5) nop nop sw $5,-16($5) add $5,$5,$5
Delay I2 to avoid RAW hazard on $5 from I1
Note: no RAW hazard from on $5 from I1 now n ow
3) Assume there is full forwarding. Indicate hazards and add NOP instructions to eliminate them.
With full forwarding, an ALU instruction can forward a value to EX stage of the next instruction without a hazard. However, a load cannot forward to the EX stage of the next instruction (by can to the instruction after that). The code that eliminates these hazards by inserting nop instructions is: lw $5,-16($5) nop Delay I2 to avoid RAW hazard on $5 from I1 sw $5,-16($5) Value for $5 is forwarded from I2 now add $5,$5,$5 Note: no RAW hazard from on $5 from fro m I1 now
4) What is the total execution time of this instruction sequence WITHOUT forwarding and WITH full
forwarding? What is the speedup achieved by adding full forwarding to a pipeline that had no forwarding? The total execution time is the clock cycle time times the number of cycles. Without any stalls, a three-instruction sequence executes in 7 cycles (5 to complete the first instruction, then one per 3
instruction). The execution without forwarding must add a stall for every nop we had in 4.13.2, and execution forwarding must add a stall cycle for every nop we had in 4.13.3. Overall, we get: Without Forwarding
With Full Forwarding
Speed-up due to forwarding
(7 + 2) 220ps = 1980ps
(7 + 1) 240ps = 1920ps
1.03
5) Add NOP instructions to this code to eliminate hazards if there is ALU-ALU forwarding only
(no forwarding from the MEM to the EX stage). lw $5,-16($5) nop nop sw $5,-16($5) add $5,$5,$5
Can’t use ALU-ALU forwarding ($5 loaded in MEM)
6) What is the total execution time of this instruction sequence with only ALU-ALU forwarding?
What is the speedup over a no-forwarding pipeline? No forwarding 1980ps
With ALU-ALU forwarding only (7 + 2) 230ps = 2070ps
Speedup with ALUALU forwarding 0.96 (This is really a slowdown)
4
Problem 3. !"# %&'()*+" Assume that we have a five-stage machine same as the one in textbook. For the following code,
(a) (b) (c) (d)
(1)
sub $2, $5, $4 add $4, $2, $5 lw $2, 100($4) add $5, $2, $4
$2 depends on (a) $4 depends on (b) $2 depends on (c), $4 depends on (b)
Name all data dependencies see the red bold words.
(2)
Which data hazards can be resolved by renaming? Write down the code after renaming and with minimal data hazards
#$%&' ()&'$ *$%&' +(, -' $'./01'2 -3 $',(4%,56 7)&'$ $',(4%,58 &9' +/2' *%00 0//: 0%:' &9' )/00/*%,58 ;< %, =+> (,2 =2> %. $',(4'2 ;?" (a) (b) (c) (d) (3)
IM
sub add lw add
$2, $4, $6,, $6 $5,
$5, $4 $2, $5 100($4) $6,, $6 $4
After renaming, which data hazard can be resolved via forwarding? Illustrate all the forwarding using 5-stage pipelining figures similar to those in the textbook.
Reg
D
Re
$2 IM
Reg
D
Re
$4 IM
Reg
D
Re $6
IM
Reg
D
5
Problem 4. !", %&'()*+ @(&( )/$*($2%,5 Considering data forwarding for the pipeline below, state how to generate control signals for MUX A. I.e., use plain English AND logic function such as EX/MEM.RegisterRd EX/MEM.RegisterRd != 0 to explain when the control signal for MUX A should be 00, 01 and 10 respectively. I.
control signal = 00 No data forwarding. Neither condition below holds.
II.
control signal = 01 Forward result from MEM/WB register: If ((MEM/WB.RegWrite) ((MEM/WB.RegWrite) && (MEM/WB.RegisterRd (MEM/WB.RegisterRd != 0) && (MEM/WB.RegisterRd (MEM/WB.RegisterRd == ID/EX.RegisterRs)) ID/EX.RegisterRs))
III. control
signal = 10
Forward result from EX/MEM register. If ((EX/MEM.RegWrite) && (EX/MEM.RegisterRd (EX/MEM.RegisterRd != 0) (EX/MEM.RegisterRd (EX/MEM.RegisterRd == ID/EX.RegisterRs)) ID/EX.RegisterRs))
Problem 5. (15 points) Media
applications that play audio or video files are part of a class of workloads called “streaming” workloads; i.e., they bring in large amounts of data but do not reuse much of it. Consider a video streaming workload that accesses a 512 KB working set sequentially with the following address stream:
a. Assume a 64 KB direct-mapped cache with a 32-byte line. What is the miss rate for the address stream above? How is this miss rate sensitive to the size of the cache or the working set? How would you categorize the misses this workload is experiencing, explain what causes the misses? 6.25% miss rate. The miss rate doesn’t change with cache size or working set. These are cold misses, which means that the data are brought back from the memory for the first time.
b. Re-compute the miss rate when the cache line (block) size is 16 bytes, 64 bytes, and 128 bytes. What kind of locality is this workload exploiting? 12.5%, 3.125% and 1.5625% miss rates for 16-byte, 64-byte and 128-byte blocks. Spatial locality. (1/8, 1/32, 1/64)
c.
“Prefetching” is a technique that leverages predictable address patterns to speculatively bring in additional cache lines when a particular cache line is accessed. One example of prefetching is a stream buffer that prefetches sequentially adjacent cache lines into a separate buffer when a particular cache line is brought in. if the data is found in the prefetch buffer, it is considered as a hit and moved into the cache and the next cache line is prefetched. Assume a two-entry stream buffer and assume that the cache latency is such that a cache line can be loaded before the computation on the previous cache line is completed. What is the miss rate for the address stream above? With next-line prefetching, miss rate will be near 0%.
Cache block size (B) can affect both miss rate and miss latency. Assuming a 1-CPI machine with an average of 1.35 references (both instruction and data) per instruction, help find the optimal block size given the following miss rates for various block sizes. Size
8
16
32
64
128
Miss Rate
4%
3%
3%
1.5%
1%
d. What is the optimal block size for a miss latency of 20 ! B cycles? 8-byte.
e.
What is the optimal block size for a miss latency of 24 + B cycles? 16- byte.
f.
For constant miss latency, what is the optimal block size? 128-byte. (Size*C)