Dlx Architecture

DLX Architecture The architecture of DLX was chosen based b ased on observations about most frequently used primitives in programs. DLX provides a good architectural model for study, not only because of the recent popularity of this type of machine, mach ine, but also because it is easy e asy to understand. Like most recent load/store machines, DLX emphasizes A simple load/store instruction set Design for pipelining efficiency An easily decoded instruction set Efficiency as a compiler target

Registers for DLX thirty-two 32-bit general purpose registers (GPRs), named R0, R1, ..., R31. The value of R0 is always 0. thirty-two floating-point registers (FPRs), which can be used as 32 single precision (32-bit) registers or even-odd pairs holding double-precision values. Thus, the 64-bit FPRs are named F0,F2,...,F30 a few special registers can be transferred to and from the integer registers.

Data types for DLX for integer data - 8-bit bytes - 16-bit half words - 32-bit words for floating point - 32-bit single precision - 64-bit double precision The DLX operations work on 32-bit 32 -bit integers and 32- or 64-bit floating point. B ytes and half words are loaded into registers with either zeros or the sign bit replicated to fill the 32 bits of the registers.

Memory byte addressable Big Endian mode 32-bit address two addressing modes (immediate and displacement). Register deferred and absolute addressing with 16-bit field are accomplished using R0. memory references are load/store between memory and GPRs or FPRs

access to GPRs can be to a byte, to a halfword, or to a word all memory accesses must be aligned there are instructions for moving between a FPR an d a GPR

Instructions instruction layout for DLX complete list of instructions in DLX 32 bits(fixed) must be aligned

Operations There are four classes of instructions: Load/Store Any of the GPRs or FPRs may be loaded and stored except that loading R0 has no effect. ALU Operations All ALU instructions are register-register instructions. The operations are : - add - subtract - AND - OR - XOR - shifts Compare instructions compare two registers (=,!=,<,>,=<,=>). If the condition is true, these instructions place a 1 in the destination register, otherwise they place a 0. Branches/Jumps All branches are conditional.The branch cond ition is specified by the instruction, which ma y test the register source for zero or nonzero. Floating-Point Operations - add - subtract - multiply - divide

DLX Instruction Set Complete list of the instructions in DLX. Instruction

Instruction meaning

type/opcode Data transfers

Move data between registers and memory, or between the integer and FP or special register; only memory address mode is 16-bit displacement + contents of a GPR

LB, LBU, SB

Load byte, load byte unsigned, store byte

LH, LHU, SH

Load halfword, load halfword unsigned, store halfword

LW, SW

Load word, store word (to/from integer registers)

LF, LD, SF, SD

Load SP float, load DP float, store SP float, store DP float (SP - single precision, DP - double precision)

MOVI2S, MOVS2I Move from/to GPR to/from a special register MOVF, MOVD

Copy one floating-point register or a DP pair to another register or pair

MOVFP2I, MOVI2FP

Move 32 bits from/to FP tegister to/from integer registers

Arithmetic / Logical

Operations on integer or logical data in GPRs; signed arithmetics trap on overflow

ADD, ADDI, ADDU, ADDUI

Add, add immediate (all immediates are 16-bits); signed and unsigned

SUB, SUBI, SUBU, Subtract, subtract immediate; signed and unsigned SUBUI MULT, MULTU, DIV, DIVU

Multiply and divide, signed and unsigned; ope rands must be floating-point registers; all operations take and yield 32-bit values

AND, ANDI

And, and immediate

OR, ORI, XOP, XOPI

Or, or immediate, exclusive or, exclusive or immediate

LHI

Load high immediate - loads upper half of register with immediate

SLL, SRL, SRA, SLLI, SRLI, SRAI

Shifts: both immediate(S__I) and variable form(S__); shifts are shift left logical, right logical, right arithmetic

S__, S__I

Set conditional: "__"may be LT, GT, LE, GE, EQ, NE

Control

Conditional branches and jumps; PC-relative or through register

BEQZ, BNEZ

Branch GPR equal/not equal to zero; 16-bit offset from PC

BFPT, BFPF

Test comparison bit in the FP status register and branch; 16-bit offset from PC

J, JR

Jumps: 26-bit offset from PC(J) or target in register(JR)

JAL, JALR

Jump and link: save PC+4 to R31, target is PC-relative(JAL) ot a register(JALR)

TRAP

Transfer to operating system at a vectored address

RFE

Return to user code from an exception; restore user code

Floating point

Floating-point operations on DP and SP formats

ADDD, ADDF

Add DP, SP numbers

SUBD, SUBF

Subtract DP, SP numbers

MULTD, MULTF

Multiply DP, SP floating point

DIVD, DIVF

Divide DP, SP floating point

CVTF2D, CVTF2I, Convert instructions: CVT x2 y converts from type x to type y, where x and CVTD2F, are one of I(Integer), D(Double precision), or F(Single precision). Both CVTD2I, CVTI2F, operands are in the FP registers. CVTI2D __D, __F

DP and SP compares: "__" may be LT, GT, LE, GE, EQ, NE; set comparison bit in FP status register.

Addressing Modes Addressing modes are the ways how architectures specify the address of an object they want to access. In GPR machines, an addressing mode can specify a constant, a register or a location in memory.

The most common names for addressing modes (names may differ among architectures) Addressing modes

Example Instruction

Meaning

When used

Register

Add R4,R3

R4 <- R4 + R3

When a value is in a register

Immediate

Add R4, #3

R4 <- R4 + 3

For constants

Displacement

Add R4, 100(R1)

R4 <- R4 + M[100+R1] Accessing local variables

Register deffered

Add R4,(R1)

R4 <- R4 + M[R1]

Indexed

Add R3, (R1 + R3 <- R3 + M[R1+R2] R2)

Useful in array addressing: R1 - base of array R2 - index amount

Direct

Add R1, (1001) R1 <- R1 + M[1001]

Useful in accessing static data

Memory deferred

Add R1, @(R3) R1 <- R1 + M[M[R3]]

If R3 is the address of a pointer p, then mode yields *p

Accessing using a pointer or a computed address

Autoincrement

Add R1, (R2)+

R1 <- R1 +M[R2] R2 <- R2 + d

Useful for stepping through arrays in a loop. R2 - start of array d - size of an element

Autodecrement

Add R1,-(R2)

R2 <-R2-d R1 <- R1 + M[R2]

Same as autoincrement. Both can also be used to implement a

stack as push and pop Scaled

Add R1, 100(R2)[R3]

R1
Used to index arrays. May be applied to any base addressing mode in some machines.

An Implementation of DLX This unpipelined implementation is not the most economical or the highest-performance implementation without pipelining. Instead, it is designed to lead naturally to a pipelined implementation. Implementing the instruction set r equires the introduction of several temporary registers that are not part of the architecture. Every DLX instruction can be implemented in at most five clock cycles. The five clock cycles are Instruction fetch cycle (IF) Instruction decode/register fetch (ID) Execution/Effective address cycle (EX) Memory access/branch completion cycle (MEM) Write-back cycle (WB)

Instruction fetch cycle (IF): IR <- MEM[PC] NPC <- PC +4 Operation: - Send out the PC and fetch the instruction from memory into the instruction register (IR) - increment the PC by 4 to address the next sequential instruction - the IR is used to hold the instruction that will be needed on subsequ ent clock cycles - the NPC is used to hold the next sequential PC (program counter)

Instruction decode/register fetch ( ID): A <- Regs[IR 6..10] B <- Regs[IR 11..15] 16 Imm <- ((IR 16) ##IR 16..31) Operation: - Decode the instruction and access the register file to read the registers. - the output of the general-purpose registers are read into two temporary registers (A and B) for use in later clock cycles. - the lower 16 bits of the IR are also sign-extended and stored into the temporary register IMM, for use in the next cycle.

- decoding is done in parallel with reading registers, which is possible because these fields are at a fixed location in the DLX instruction format. This technique is known as f ix ed-fi eld decodin g.

Execution/Effective address cycle (EX): The ALU operates on the operand prepared in the prior cycle, performing one of four functions depending on the DLX instruction type Memory reference: ALUOutput <- A +Imm Operation: The ALU adds the operands to form the effective address and places the result into the register ALUOutput Register-Register ALU instruction: ALUOutput <- A op B Operation: The ALU performs the operation specified by the opcode on the value in register A and on the value in register B. The result is placed in the register ALUOutput. Register- Immediate ALU instruction: ALUOutput <- A op Imm Operation: The ALU performs the operation specified by the opcode on the value in register A and on the value in register Imm. The result is placed in the register ALUOutput. Branch: ALUOutput <- NPC + Imm Cond <- ( A op 0 ) Operation: -The ALU adds the NPC to the sign-extended immediate value in Imm to compute the address of the branch target. -Register A, which has been read in the prior cycle, is checked to determine whether the branch is taken. - the comparison operation op is the relational operator determined by the branch opcode (e.g. op is "==" for the instruction BEQZ)

Memory access/branch completion cycle ( MEM): The only DLX instructions active in this cycle a re loads, stores, and branches. Memory reference: LMD <- Mem[ALUOutput] or Mem[ALUOutput] <- B Operation: -Access memory if needed - If the instruction is load , data returns from memory and is placed in the LMD (load memory data) register - If the instruction is store, data from the B register is written into memory. - In either case the address used is the one computed during the prior cycle and stored in the register ALUOutput Branch:

if (cond) PC <- ALUOutput else PC <- NPC Operation: - If the instruction branches, the PC is replaced with branch destination address in the register ALUOutput - Otherwise, PC is replaced with the incremented PC in the register NPC

Write-back cycle (WB): Register-Register ALU instruction: Regs[IR 16..20] <- ALUOutput Register-Immediate ALU instruction: Regs[IR 11..15] <- ALUOutput Load instruction: Regs[IR 11..15] <- LMD Operation: - Write the result into the register file, whether it comes from the memory(LMD) or from ALU (ALUOutput) - the register destination field is in one of two positions depending on the opcode

The Basic Pipeline for DLX We can pipeline the DLX datapath with almost no changes by starting a new instruction on each clock cycle. Each of the clock cycles of the DLX datapath now becomes a pipe stage: a cycle in the pipeline. While each instruction takes five clock cycles to complete, during each clock cycle the hardware will initiate a new instruction and will execute some part of the five different instructions. The typical way to show what is going on is: Instr Num 1 2 3 4 5 6 7 8 9 instr i IF ID EX MEM WB instr i+1 instr i+2 instr i+3 instr i+4

IF

ID

EX

MEM

WB

IF

ID

EX

MEM

WB

IF

ID

EX

MEM

WB

IF

ID

EX

MEM

Pipeline Hazards

WB

There are situations, called hazards , that prevent the next instruction in the instruction stream from being executing during its designated clock cycle. Hazards reduce the performance from the ideal speedup gained by pipelining. There are three classes of hazards: Structural Hazards. They arise from resource conflicts when the hardware cannot support all possible combinations of instructions in simultaneous overlapped execution. Data Hazards. They arise when an instruction depends on the result of a previous instruction in a way that is exposed by the overlapping of instructions in the pipeline. Control Hazards.They arise from the pipelining of branches and other instructions that change the PC. Hazards in pipelines can make it necessary to stall the pipeline. The processor can stall on different events: A cache miss . A cache miss stalls all the instructions on pipeline both b efore and after the instruction causing the miss. A hazard in pipeli ne. Eliminating a hazard often requires that some instructions in the pipeline to be allowed to proceed while others are delayed. When the instruction is stalled, all the instructions issued later than the stalled instruction are also stalled. Instructions issued earlier than the stalled instruction must continue, since otherwise the haz ard will never clear.

Data Hazards A major effect of pipelining is to change the relative timing of instructions by overlapping their execution. This introduces data and control hazards. Data hazards occur when the pipeline changes the order of read/write accesses to operan ds so that the order differs from the order seen by sequentially executing instructions on the unpipelined machine. Consider the pipelined execution of these instructions:

ADD R1, R2, R3 SUB R4, R5, R1 AND R6, R1, R7 OR

R8, R1, R9

XOR R10,R1,R11

1 2 3

4

5

IF ID EX

MEM WB EX MEM WB

IF IDsub IF IDand EX IF IDor IF

6

7

8

9

MEM WB EX

MEM WB

IDxor EX

MEM WB

Control Hazards can cause a greater performance loss for DLX pipeline than data hazards. When Contr ol hazards a branch is executed, it may or may not change the PC (program counter) to something other than its current value plus 4. If a branch changes the PC to its target address, it is a taken br anch ; if it falls through, it is not taken . If instruction i is a taken br anch , then the PC is normally not changed until the end of MEM stage, after the completion of the address calculation and comparison(see diagram). The simplest method of dealing with branches is to stal l the pipeline as soon as the branch is detected until we reach the MEM stage, which determines the new PC. The pipeline behavior looks like :

Branch Branch successor Branch successor+1

IF ID IF(stall)

EX

MEM

stall stall

WB IF

ID EX MEM

WB

IF ID

MEM

EX

WB

Dlx Architecture

Recommend Documents