Superscalar processor While While a super supersc scala alarr CPU is typic typicall ally y also also pipelined, pipelined, pipelining and superscalar execution are considered different ferent performance performance enhancement enhancement techniques. techniques. The former former executes multiple instructions in the same execution unit in parallel by dividing the execution unit into different phases, whereas the latter executes multiple instructions in parallel by using multiple execution units. The superscalar technique is traditionally traditionally associated with several several identifying identifying characteristics characteristics (within a given given CPU): •
Simple superscalar pipeline. By fetching and dispatching two instructions at a time, a maximum of two instructions per cycle can be completed. (IF = Instruction Fetch, ID = Instruction Decode, EX = Execute, MEM = Memory access, WB = Register write back, i = Instruction number, t = Clock cycle [i.e., time])
•
•
1
Instructions Instructions are issued from a sequential instruction instruction stream The CPU dynamically checks for data dependencies between instructions at run time (versus softcies ware checking at compile at compile time) time) The CPU can execu execute te multipl multiplee instruct instruction ionss per clock clock cycle
Hisstor Hi tory
Seymour Cray Seymour Cray's 's CDC 660 6600 0 from 1966 1966 is often often menti mentione oned d as the first superscalar design. design. The Motorola MC88100 Motorola MC88100 (1988), the Intel the Intel i960CA i960CA (1989) and the AMD 2900029000series 29050 (1990) microprocessors were the first commercial single-chip superscalar microprocessors. RISC microprocessors like these were the first to have superscalar execution, because RISC architectures frees transistors and die area which could be used to include multiple execution units (this was why RISC designs were faster faster than CISC than CISC designs designs through the 1980s and into the 1990s).
superProcessor board of a CRAY a CRAY T3e supercomputer T3e supercomputer with four superscalar Alpha 21164 21164 processors
A superscalar processor is a CPU CPU that that implements a form form of of parallelism called instructio instruction-lev n-level el paralleli parallelism sm within within a single single processor. processor. It theref therefore ore allows allows for for more more throughput (the throughput (the number of instructions that can be executed in a unit of time) than would otherwise be possible at a given clock given clock rate. rate. A superscalar superscalar processo processorr can exeexecute more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to different execution ferent execution units units on the processor. Each execution execution unit is not a separate processor (or a core if the processor is a multi-core a multi-core processor), processor), but an execution resource within a single CPU such as an arithmetic logic unit. unit.
Exce Except pt for CPUs CPUs used used in low-power applications, embedded systems, systems, and battery and battery-power -powered ed devices, devices, essentially tially all general general-purp -purpose ose CPUs devel developed oped since since about about 1998 are superscalar. The P5 Pentium was the first superscalar x86 processor; sor; the the Nx586, Nx586, P6 Pentium Pro and AMD K5 were among the first designs which decode x86 x86-instructions -instructions asynchronousl asynchronously y into dynamic dynamic microcode-like microcode-like micro-op sequenc sequences es prior prior to actual actual execu execution tion on a supersc superscalar alar microarchitecture;; this microarchitecture this opened opened up for dynam dynamic ic sche schedu dulin ling g of buffered partial instructions instructions and enabled more parallelism to be extracted compared to the more rigid methods used in the simpler P5 Pentium; Pentium; it also simplified simplified speculative execution and execution and allowed higher clock frequencies cies compared compared to design designss such such as the advanced advanced Cyrix 6x86.. 6x86
In Flynn’s taxonomy, taxonomy, a single-c single-core ore supersca superscalar lar procesprocessor is classified as an SISD an SISD processor processor (Single Instruction stream, stream, Single Single Data stream), stream), though though many superscala superscalarr processors support short vector operations and so could be classified as SIMD SIMD (Single Instruction stream, Multiple tiple Data streams). streams). A multi-core multi-core superscalar superscalar processor is classified as an MIMD an MIMD processor processor (Multiple Instruction Instruction streams, Multiple Data streams). 1
2
2
4
Scalar to superscalar
The simplest processors are scalar processors. Each instruction executed by a scalar processor typically manipulates one or two data items at a time. By contrast, each instruction executed by a vector processor operates simultaneously on many data items. An analogy is the difference between scalar and vector arithmetic. A superscalar processor is a mixture of the two. Each instruction processes one data item, but there are multiple execution units within each CPU thus multiple instructions can be processing separate data items concurrently. Superscalar CPU design emphasizes improving the instruction dispatcher accuracy, and allowing it to keep the multiple execution units in use at all times. This has become increasingly important as the number of units has increased. While early superscalar CPUs would have two ALUs and a single FPU, a modern design such as the PowerPC 970 includes four ALUs, two FPUs, and two SIMD units. If the dispatcher is ineffective at keeping all of these units fed with instructions, the performance of the system will be no better than that of a simpler, cheaper design.
ALTERNATIVES
other. The instructions a = b + c; d = e + f can be run in parallel because none of the results depend on other calculations. However, the instructions a = b + c; b = e + f might not be runnable in parallel, depending on the order in which the instructions complete while they move through the units. When the number of simultaneously issued instructions increases, the cost of dependency checking increases extremely rapidly. This is exacerbated by the need to check dependencies at run time and at the CPU’s clock rate. This cost includes additional logic gates required to implement the checks, and time delays through those gates. Research shows the gate cost in some cases may be nk gates, and the delay cost k 2 log n , where n is the number of instructions in the processor’s instruction set, and k is the number of simultaneously dispatched instructions. Even though the instruction stream may contain no inter-instruction dependencies, a superscalar CPU must nonetheless check for that possibility, since there is no assurance otherwise and failure to detect a dependency would produce incorrect results.
No matter how advanced the semiconductor process or how fast the switching speed, this places a practical limit A superscalar processor usually sustains an execution rate on how many instructions can be simultaneously disin excess of one instruction per machine cycle. But patched. While process advances will allow ever greater merely processing multiple instructions concurrently does numbers of execution units (e.g., ALUs), the burden of not make an architecture superscalar, since pipelined, checking instruction dependencies grows rapidly, as does multiprocessor or multi-core architectures also achieve the complexity of register renaming circuitry to mitigate that, but with different methods. some dependencies. Collectively the power consumpIn a superscalar CPU the dispatcher reads instructions tion, complexity and gate delay costs limit the achievable from memory and decides which ones can be run in par- superscalar speedup to roughly eight simultaneously disallel, dispatching each to one of the several execution patched instructions. units contained inside a single CPU. Therefore, a super- However even given infinitely fast dependency checking scalar processor can be envisioned having multiple par- logic on an otherwise conventional superscalar CPU, if allel pipelines, each of which is processing instructions the instruction stream itself has many dependencies, this simultaneously from a single instruction thread. would also limit the possible speedup. Thus the degree of intrinsic parallelism in the code stream forms a second limitation.
3
Limitations
Available performance improvement from superscalar techniques is limited by three key areas: 1. The degree of intrinsic parallelism in the instruction stream (instructions requiring the same computational resources from the CPU). 2. The complexity and time cost of dependency checking logic and register renaming circuitry 3. The branch instruction processing. Existing binary executable programs have varying degrees of intrinsic parallelism. In some cases instructions are not dependent on each other and can be executed simultaneously. In other cases they are inter-dependent: one instruction impacts either resources or results of the
4
Alternatives
Collectively, these limits drive investigation into alternative architectural changes such as very long instruction word (VLIW), explicitly parallel instruction computing (EPIC), simultaneous multithreading (SMT), and multicore computing. With VLIW, the burdensome task of dependency checking by hardware logic at run time is removed and delegated to the compiler. Explicitly parallel instruction computing (EPIC) is like VLIW, with extra cache prefetching instructions. Simultaneous multithreading, often abbreviated as SMT, is a technique for improving the overall efficiency of superscalar processors. SMT permits multiple independent
3
threads of execution to better utilize the resources provided by modern processor architectures. Superscalar processors differ from multi-core processors in that the several execution units are not entire processors. A single processor is composed of finer-grained execution units such as the ALU, integer multiplier, integer shifter, FPU, etc. There may be multiple versions of each execution unit to enable execution of many instructions in parallel. This differs from a multi-core processor that concurrently processes instructions from multiple threads, one thread per processing unit (called “core”). It also differs from a pipelined processor, where the multiple instructions can concurrently be in various stages of execution, assembly-line fashion. The various alternative techniques are not mutually exclusive—they can be (and frequently are) combined in a single processor. Thus a multicore CPU is possible where each core is an independent processor containing multiple parallel pipelines, each pipeline being superscalar. Some processors also include vector capability.
5
See also •
Out-of-order execution
•
Super-threading
•
Simultaneous multithreading (SMT)
•
Speculative execution / Eager execution
•
•
6
Software lockout, a multiprocessor issue similar to logic dependencies on superscalars Shelving buffer
References •
•
•
•
Mike Johnson, Superscalar Microprocessor Design , Prentice-Hall, 1991, ISBN 0-13-875634-1 Sorin Cotofana, Stamatis Vassiliadis, “On the Design Complexity of the Issue Logic of Superscalar Machines”, EUROMICRO 1998: 10277-10284 Steven McGeady, “The i960CA SuperScalar Implementation of the 80960 Architecture”, IEEE 1990, pp. 232–240 Steven McGeady, et al., “Performance Enhancements in the Superscalar i960MM Embedded Microprocessor,” ACM Proceedings of the 1991 Con ference on Computer Architecture (Compcon) , 1991, pp. 4–7
7
External links •
Eager Execution / Dual Path / Multiple Path, By Mark Smotherman
4
8
8
TEXT AND IMAGE SOURCES, CONTRIBUTORS, AND LICENSES
Text and image sources, contributors, and licenses
8.1 •
8.2 •
•
•
•
8.3 •
Text Superscalar processor Source: https://en.wikipedia.org/wiki/Superscalar_processor?oldid=735564047 Contributors: Damian Yerrick, The Anome, Roadrunner, SimonP, Rade Kutil, Maury Markowitz, 7265, Zoicon5, AnthonyQBachler, Hadal, SpellBott, DavidCary, Dratman, VampWillow, Phe, Discospinster, Smyth, Chub~enwiki, Dyl, ZeroOne, Uli, CanisRufus, Drhex, Sleske, Liao, EmmetCaulfield, Bookandcoffee, Kenyon, Ruud Koot, Hdante, Qwertyus, Kbdank71, FlaBot, Ffaarr, LAk loho, Gurch, BMF81, YurikBot, Borgx, Długosz, Blitterbug, Cal guy, Rwwww, Eskimbot, QTCaptain, Frap, JonHarder, Joema, Valenciano, Daniel Santos, Fuzzbox, 16@r, Hgrobe, Mike Fikes, JLD, Rudá Almeida, Thijs!bot, Kubanczyk, JAnDbot, .anacondabot, Destynova, TechnoFaye, R'n'B, Anomen, Su-steve, Ksinkar, Nxavar, Retiono Virginian, Kvangend, AlleborgoBot, PanagosTheOther, Gerakibot, Hello.sanjay, Oxymoron83, Pmcgrane, WakingLili, Rilak, Houyi, Addbot, Cst17, Lightbot, Ptbotgourou, Nallimbot, Kcubeice, AnomieBOT, Techdoode, Flewis, Materialscientist, ArthurBot, JimVC3, Jeffrey Mall, Omnipaedista, Miyagawa, Zxb, Arndbergmann, Maggyero, Atyatya, EmausBot, WikitanvirBot, Dewritech, Klbrain, Cogiati, ClueBot NG, Helpful Pixie Bot, Wbm1058, IronOak, Brian Tomasik, Paul A. Clayton, GlennHK, Numbermaniac, Hamoudafg, Wandering Logic, Latvia Man, ThunderGodMod, L9G45AT0, Samhita Vasu and Anonymous: 86
Images File:Commons-logo.svg Source: https://upload.wikimedia.org/wikipedia/en/4/4a/Commons-logo.svg License: PD Contributors: ? Original artist: ? File:Folder_Hexagonal_Icon.svg Source: https://upload.wikimedia.org/wikipedia/en/4/48/Folder_Hexagonal_Icon.svg License: Cc-bysa-3.0 Contributors: ? Original artist: ? File:Processor_board_cray-2_hg.jpg Source: https://upload.wikimedia.org/wikipedia/commons/c/c2/Processor_board_cray-2_hg.jpg License: CC BY-SA 2.5 Contributors: Own work Original artist: Hannes Grobe & Chresten Wübber, Alfred Wegener Institute for Polar and Marine Research, Bremerhaven, Germany File:Superscalarpipeline.svg Source: https://upload.wikimedia.org/wikipedia/commons/4/46/Superscalarpipeline.svg License: CC BYSA 3.0 Contributors: Own work Original artist: Amit6, original version (File:Superscalarpipeline.png) by User:Poil
Content license Creative Commons Attribution-Share Alike 3.0