INSTRUCTION LEVEL PARALLELISM INTRODUCTION Computer designers and computer architects have been striving to improve uniprocessor computer performance since the first computer was designed. The most significant advances in uniprocessor performance performance have come from exploiting advances in implementation implementation technology. Architectural innovations have also played a part, and one of the t he most significant of these over The last last decade has been the rediscovery of RISC architectures. architectures. Now that RISC architectures architectures have gained acceptance both in scientific and marketing circles, computer architects have been thinking of new ways to improve uniprocessor performance. Many of these proposals such as VLIW, superscalar, and even relatively old ideas such as vector processing try to improve computer performance by exploiting instruction-level instruction-level parallelism. They take advantage of this parallelism by issuing more than one instruction per cycle explicitly (as in VLIW or superscalar machines) or implicitly (as in vector machines). In this paper we will limit ourselves to improving uniprocessor performance, and will not discuss methods of improving application performance by using multiple processors in parallel. The amount of instruction-level parallelism varies widely depending on the type of code being executed. When we consider uniprocessor performance improvements due to exploitation of instruction-level parallelism, it is important to keep in mind the type of application environment. If the applications are dominated by highly parallel code (e.g., weather forecasting), any of a number of different parallel computers (e.g., vector, MIMD) would improve application performance. However, if the dominant applications have little instruction-level parallelism (e.g., compilers, editors, event-driven simulators, lisp interpreters), the performance improvements will be much smaller.
ILP HISTORY Instruction-level Parallelism (ILP) is a critical technique used in computer architecture for processor and compiler design. ILP can improve the program execution performance by causing individual machine operations to execute in parallel. ILP appeared in the field of computer design 30 years ago. However it didn¶t play a important role in computer architecture design until 1980s. The quick development of the computer architecture technologies, ILP played a main role in designing a computer system, including the hardware and software design. So far, more CPU manufactures had incorporated ILP to their CPUs, and a bunch of new hardware and software techniques for ILP have become a popular topic. Actually when we talk about the ILP, it means how many instructions can be executed or issued at one time. Though we need hardware support for the ILP, the amount of availability of ILP is really got by the compiler before execution. So the available ILP in program became one of the central topics in computer compiler design in recent years. The study of how much instruction level parallelism actually exists in programs is a pretty interest field in processor architecture to boost the performance of a single processor by overlapping the execution of multiple instructions, using parallel processing models such as VLIW, superscalar, etc. This study attempts to measure the available parallelism in a program and tries to indicate whether the performance bottleneck is insufficient parallelism in the instruction stream. Its result will lead to reduce the instruction dependencies in the program by using appropriate compiler optimizations. So the study of ILP will do a great help to improve the performance of the nowadays¶ processor. And it can also make the program ILP 1
independent of the machine architecture. Program parallelism is very different from machine parallelism. If the program parallelism is low relative to machine parallelism, overall performance is limited by the program parallelism.
PARALLELISM With the era of increasing processor speeds slowly coming to an end, computer architects are exploring new ways of increasing throughput. One of the most promising is to look for and exploit different types of parallelism in code.
TYPES OF PARALLELISM There are three main types of o f parallelism. Instruction Level Parallelism
Instruction level parallelism (ILP) takes advantage of sequences of instructions that require different functional units (such as the load unit, ALU, FP multiplier, etc). Different architectures approach this in different ways, but the idea is to have these non-dependent instructions executing simultaneously to keep the functional functional units busy as often o ften as possible. Data Level Parallelsim
Data level parallelism (DLP) is more of a special case than instruction level parallelism. DLP to the act of performing the same operation on multiple datum simultaneously. A classic example of DLP is performing an operation on an image in which processing each pixel is independent from the ones around it (such as brightening). This type of image processing lends itself well to having multiple pixels modified simultaneously using the same modification function. Other types of operations that allow the exploitation exp loitation of DLP are matrix, array, and vecto r processing. Thread Level Parallelism
Thread level parallelism (TLP) is the act of running multiple flows of execution of a single process simultaneously. TLP is most often found in applications that need to run independent, unrelated tasks (such as computing, memory accesses, and IO) simultaneously. These types of applications are often found on machines that have a high workload, such as web servers. TLP is a popular ground for current research due to the rising popularity of multi-core and multi processor systems, which allow for different threads to truly execute in para llel.
2
INSTRUCTION LEVEL PARALLELISM DEFINITION
Abbreviated as ILP, I nstructionnstruction- Level P arallelism arallelism is a measurement of the number of operations that can be performed simultaneously in a computer program. Microprocessors exploit ILP by executing multiple instructions from a single program in a single cyc le.
EXPLANTION Instruction-level parallelism (ILP) is a measure of how many of the operations in a computer program can be performed simultaneously. Consider the following following program: pro gram:
1. e = a + b 2. f = c + d 3. g = e * f Operation 3 depends on the results of operations 1 and 2, so it cannot be calculated until both of them are completed. However, operations 1 and 2 do not depend on any other operation, so they can be calculated simultaneously. If we assume that each operation can be completed in one unit of time then these three instructions can be completed in a total of two units of time, giving an ILP of 3/2. A goal of compiler and processor designers is to identify and take advantage of as much ILP as possible. Ordinary programs are typically written under a sequential execution model where instructions execute one after the other and in the order specified by the programmer. ILP allows the compiler and the processor to overlap the execution of multiple instructions or even to change the order in which instructions are executed. How much ILP exists in programs is very application specific. In certain fields, such as graphics and scientific computing the amount can be very large. However, workloads such as cryptography exhibit much less parallelism.
DATA DEPENDANCY A data dependency in computer science is a situation in which a program statement (instruction) refers to the data of a preceding preced ing statement. Dependences are a property of programs. If two instructions are data dependent they cannot execute simultaneously. A dependence results in a hazard and the hazard causes a stall. Data dependences may occur through registers or memory
3
TYPES OF DEPENDENCY 1.Name dependence Two instructions use the same register/memory (name), but there is no flow of data. There are further two types of name dependency 1) Anti-dependence 2) Output dependence Output dependence
An output dependency occurs when the ordering of instructions will affect the final output value of a variable. In the example below, there is an output dependency between instructions 3 and 1; changing the ordering of instructions in this example will change the final value of B, thus these instructions cannot be executed in parallel. 1A=2*X 2B=A/3 3A=9*Y
Anti-dependence
An anti-dependency occurs when an instruction requires a value that is later updated. In the following example, instruction 3 anti-depends on instruction 2the ordering of these instructions cannot be changed, nor can they be executed in parallel (possibly changing the instruction ordering), as this would affect the final value o f A. 1. B = 3 2. A = B + 1 3. B = 7
2.Data True dependence A true dependency, also known as a data dependency, occurs when an instruction depends on the result of a previous instruction: 1. A = 3 2. B = A 3. C = B
4
Instruction 3 is truly dependent on instruction 2, as the final value of C depends on the instruction updating B. Instruction 2 is truly dependent on instruction 1, as the final value of B depends on the instruction updating A. Since instruction 3 is truly dependent upon instruction 2 and instruction 2 is truly dependent on instruction 1, instruction 3 is also truly dependent on instruction 1. Instruction level parallelism is therefore not an option in this example. 3.Control
Dependence
Any instruction B is control dependent on a preceding instruction A if the latter determines whether B should execute or not. In the following example, instruction 2 is control dependent on instruction 1. 1. If a == b goto AFTER 2. A = 2 * X4. 3. AFTER:
4.Resource Dependence An instruction is resource-dependent on a previously issued instruction if it requires a hardware resource which is still being used by a pre viously issued instruction e.g. div r1, r2, r3 div r4, r2, r5
TYPES OF HAZARDS 1.Data hazards A hazard is created creat ed whenever there is dependence between instructions, and they the y are close enough that the overlap caused by pipelining would change the order of access to an operand. Or we can say that data hazards occur when instructions which exhibit data dependence modify data in different stages of a pipeline. Data hazards make the performance lower. The situation when the next instruction depends on the results of the previous one is occurred very often. It means that these instructions cannot be executed together. There are three situations in which a data hazard can occur: 1. Read after write (RAW), a true dependency 2. Write after read (WAR) 3. Write after write (WAW)
5
Read After Write (RAW)
A RAW Data Hazard refers to a situation where we refer to a result that has not yet been calculated or retrieved RAW data hazard is the most common type. It arises when the next instruction tries to read a source before be fore the previous instruction writes to it. So, the next instruction gets the old value incorrectly For example: 1. R2 <- R1 + R3 2. R4 <- R2 + R3 The first instruction is calculating a value to be saved in register 2, and the second is going to use this value to compute a result for register 4. However, in a pipeline, when we fetch the operands for the 2nd operation, the results from the first will not yet have been saved, and hence we have a data dependency. We say that there is a data dependency with instruction 2, as it is dependent on the completion of instruction 1. Write After Read (WAR)
A WAR Data Hazard represents a problem with concurrent execution. WAR hazard arises when the next instruction writes to a destination before the previous instruction reads it. In this case, the previous instruction instruction gets a new value incorrectly For example: 1. R4 <- R1 + R3 2. R3 <- R1 + R2 If we are in a situation that there is a chance that 2 may be completed before 1 (i.e. with concurrent execution) we must ensure that we do not store the result of register 3 before 1 has had a chance to fetch the operands. Write After Write (WAW)
A WAW Data Hazard is another situation which may occur in a concurrent execution environment.
6
For example: 1. R2 <- R1 + R2 2. R2 <- R4 + R7 We must delay the WB (Write Back) o f 2 until the execution of 1.
The overall picture of the data hazard is described in a picture below
2.Structural hazards A structural hazard occurs when a part of the processor's hardware is needed by two or more instructions at the same time. A canonical example is a single memory unit that is accessed both in the fetch stage where an instruction is retrieved from memory, and the memory stage where data is written and/or read from memory. 3.Control
hazards (branch hazards)
Branching hazards (also known as control hazards) occur with branches. On many instruction pipeline microarchitectures, the processor will not know the outcome of the branch when it needs to insert a new instruction into the pipeline.
HARDWARE SUPPORT FOR ILP We can see there are several mechanisms for supporting the ILP by hardware: first, using multiple, parallel functional units and second, pipelining the functional units . And also we can use dynamic scheduling, such as scoreboarding or Tomasulo approach, to reduce the 7
dependencies among program. Superscalar and VLIW are the basic hardware techniques for exploiting the ILP. Superscalar and VLI processors can potentially provide large performance improvements over their scalar predecessors by providing multiple data paths and functional units. The parallel resourses are exploited by concurrently executing independent instructions from the instruction stream. However, conditional branch instructions pose difficult problems for all types of processors that exploit ILP. Recent studies have shown that by using conventional code optimization and scheduling methods, superscalar and VLIW processors cannot produce a sustained speedup of more than two for nonnumeric programs. For such programs, conventional architectural and compilation methods do not provide enough support to utilize these processors.
COMPILER SUPPORT FOR ILP As we know, the ILP is exploited both by compiler and hardware support. However compiler provides the inherent and implicit ILP in program to hardware by compilation optimization. There is lots of compiler techniques for extracting the available ILP in programs: Scheduling register allocation and renaming. Loop unroll control-flow analysis and optimization:Branch prediction and speculat ion. memory access optimization: value prediction. pred iction.
TECHNIQUES FOR ILP TECHNIQUES
REDUCES
Forwarding and bypassing
Potential data hazard stall
Delayed branches & branch scheduling
Control hazard stalls
Basic dynamic scheduling (scoreboarding)
Data hazards from true dependencies
Dynamic scheduling with renaming
Data hazards from antideps and output deps.
Dynamic branch prediction
Control stalls
Speculation
Data and control hazard stalls
Dynamic memory disambiguation
Data hazard stalls with memory
Dynamic memory disambiguation
Data hazard stalls with memory
8
ILP Architectures ILP Architectures: Architectures: information embedded in the program pertaining pertaining to available available parallelism between instructions and operations in the pro gram is refer to ilp architecture. ILP architecture is a contract (instruction format and the interpretation of the bits that constitute an instruction) between the class of programs that are written for the architecture and the set of processor implementations of that architecture.
ILP Architectures Classifications Sequential Architectures: the program is not expected to convey any explicit information regarding parallelism. (Superscalar processors) Dependence Architectures: the program explicitly indicates the dependences that exist between operations (Dataflow processors) Independence Architectures: the program provides information as to which operations are independent of one another. (VLIW processors)
Sequential architecture and superscalar processors Program contains no explicit information regarding dependencies that exist between instructions. Dependencies between instructions must be determined by the hardware. It is only necessary to determine dependencies with sequentially preceding instructions that have been issued but not yet completed. Compiler may re-order instructions to facilitate the hardware¶s task of extracting parallelism
Superscalar Processors A superscalar CPU architecture implements a form of parallelism called instruction level parallelism within a single processor. It therefore allows faster CPU throughput than would otherwise be possible at a given clock rate.
Dependence architecture and data flow processors The compiler (programmer) identifies the parallelism in the program and communicates it to the hardware (specify the dependences between operations). The hardware determines at run-time when each operation is independent from others and perform scheduling. Here, no scanning of the sequential program to determine dependences Objective: execute the instruction at the earliest possible time (available input operands and functional units).
9
Independence architecture and VLIW processors By knowing which operations are independent, the hardware needs no further checking to determine which instructions can be issued in the sa me cycle The set of independent operations >> the set of dependent operations Only a subset of independent operations are specified. The compiler may additionally specify on which functional unit and in which cycle an operation is executed. The hardware needs to make no run-time decisions
VLIW Processor Very long instruction word or VLIW refers to a CPU architecture designed to take advantage of instruction level parallelism (ILP). A processor that executes every instruction one after the other (i.e. a non-pipelined scalar architecture) may use processor resources inefficiently, potentially leading to poor performance. The performance can be improved by executing different sub-steps of sequential instructions simultaneously (this is pi pelining ), ), or even executing multiple instructions entirely simultaneously as in superscalar architectures. Further improvement can be achieved by executing instructions in an order different from the order they appear in the program; this is called out-of-order execution.
Limits to ILP y y
Hardware sophistication is a problem that arises when we use ILP in a system. Compiler sophistication
Overcoming Limits y y
y
The compiler technology must be advanced. Significantly new and different hardware techniques may be able to overcome limitations that were assumed assumed in the studies. studies. However, unlikely such advances when coupled with realistic hardware will overcome these limits in near futur futuree
10
Fields where ILP is being used: Micro-architectural techniques that are used to exploit ILP include: y
y
y
y
y
y
Instruction pipelining where the execution of multiple instructions can be partially overlapped. Superscalar execution, VLIW, and the closely related Explicitly Parallel Instruction Computing concepts, in which multiple execution units are used to execute multiple instructions in parallel. Out-of-order execution where instructions execute in any order that does not violate data dependencies. Note that this technique is independent of both pipelining and superscalar. Current implementations of out-of-order execution dynamically (i.e., while the program is executing and without any help from the compiler) extract ILP from ordinary programs. An alternative is to extract this parallelism at compile time and somehow convey this information to the hardware. Due to the complexity of scaling the out-oforder execution technique, the industry has re-examined instruction sets which explicitly encode multiple independent o perations per instruction. Register renaming which refers to a technique used to avoid unnecessary serialization of program operations imposed by the reuse of registers by those operations, used to enable out-of-order execution. Dataflow architectures are another class of architectures where ILP is explicitly specified, but they have not been actively researched since the 1980s.
Speculative execution which allow the execution of complete instructions or parts of instructions before being certain whether this execution should take place. A commonly used form of speculative execution is control flow speculation where instructions past a control flow instruction (e.g., a branch) are executed before the target of the control flow instruction is determined.
CONCLUSION Instruction-level parallelism is mainly used to increase processor's performance; however, parallelism can also be used to increase the energy efficiency of a system. Instruction-level parallelism makes it possible to execute more than one instruction per cycle. Today¶s processors use more than one pipeline, which means that they have superscalar architecture. Instructionlevel parallelism increases the performance but an ideal sequence of uniform instructions is rare. The execution of one instruction often depends on the result of the previous instruction¶s execution. This situation is a data hazard. Data hazards reduce the architecture performance. ILP techniques are used to expose independent instructions in a sequential program. With adequate hardware support, the execution of such independent instructions can be parallelized, reducing the program execution time. The performance improvement that is given by instruction-level parallelism strongly depends on the ability to find independent instructions. Data dependences, control dependences, and resource conflicts are the fundamental limitations that bound the amount of available parallelism, and therefore the potential increase in performance. The next subsections describe these dependences and t he way to reduce their impact on performance.
11