Pipelining. March 17, Howard Huang 1

Pipelining We ve seen two possible implementations of the IPS architectre. A single-cycle path eectes each instrction in jst one clock cycle, bt the cycle time may be very long. A mlticycle path has mch shorter cycle times and higher clock rates, bt each instrction reqires many cycles to eecte. A third approach, pipelining, yields the best of both worlds and is sed in every modern desktop processor. Cycle times are short so clock rates are high. Bt we can still eecte an instrction in abot one clock cycle! Today we introdce pipelining and its benefits, and on Wednesday we ll show a pipelined path and control nit. After spring break we ll talk abot what makes pipelining difficlt in real life and what to do abot it. arch 7, 23 2-23 Howard Hang

Instrction eection review Eecting a IPS instrction can take p to five steps. Step Instrction Fetch Instrction Decode Eecte emory back Name Description an instrction from. sorce registers and generate control signals. Compte an R-type reslt or a branch target. or write the. Store a reslt in the destination register. However, as we saw, not all instrctions need all five steps. Instrction Steps reqired beq R-type sw lw arch 7, 23 Pipelining 2

Single-cycle path diagram PC 4 Add Reg Shift left 2 Add PCSrc Instrction [3-] Instrction I [25-2] I [2-6] I [5 - ] register register 2 register 2 Registers ALU Zero Reslt ALUOp em Data emtoreg RegDst ALUSrc em I [5 - ] Sign etend arch 7, 23 Pipelining 3

Single-cycle review All five eection steps occr in one clock cycle. This means the cycle time mst be long enogh to accommodate all the steps of the most comple instrction a lw in or instrction set. If the register file has a ns latency and the memories and ALU have a 2ns latency, lw will reqire 8ns. Ths all instrctions will take 8ns to eecte. Each hardware element can only be sed once per clock cycle. A lw or sw mst access twice (in the and stages), so there are separate instrction and memories. There are mltiple adders, since each instrction increments the PC () and performs another comptation (). On top of that, branches also need to compte a target. arch 7, 23 Pipelining 4

lticycle path diagram PC PC IorD ALUSrcA em Address emory em em Data IR [3-26] [25-2] [2-6] [5-] [5-] RegDst register register 2 register Reg 2 Registers A B 4 2 3 ALU Zero Reslt ALUOp ALU Ot PCSorce Instrction register emory register Sign etend Shift left 2 ALUSrcB emtoreg arch 7, 23 Pipelining 5

lticycle review Instrction eection is split into five stages, each taking one clock cycle. The cycle time is shorter, since each stage is relatively simple. With a ns delay for registers and 2ns for the and ALU, the cycle time wold be 2ns. Only necessary stages are eected, so more comple instrctions will not slow down simpler ones. Instrction Steps reqired Cycles beq 3 R-type 4 sw 4 lw 5 The actal CPI will depend on the particlar instrction mi. We can get by with only one and one ALU, becase they can be resed in different clock cycles of a single instrction eection. arch 7, 23 Pipelining 6

Eample: Instrction Fetch () Let s qickly review how lw is eected in the single-cycle path. We ll ignore PC incrementing and branching for now. In the Instrction Fetch () step, we read the instrction. Reg Instrction [3-] Instrction I [25-2] I [2-6] I [5 - ] register register 2 register 2 Registers ALU Zero Reslt ALUOp em Data emtoreg RegDst ALUSrc em I [5 - ] Sign etend arch 7, 23 Pipelining 7

Instrction Decode () The Instrction Decode () step reads the sorce register from the register file. Reg Instrction [3-] Instrction I [25-2] I [2-6] I [5 - ] register register 2 register 2 Registers ALU Zero Reslt ALUOp em Data emtoreg RegDst ALUSrc em I [5 - ] Sign etend arch 7, 23 Pipelining 8

Eecte () The third step, Eecte (), comptes the effective from the sorce register and the instrction s constant field. Reg Instrction [3-] Instrction I [25-2] I [2-6] I [5 - ] register register 2 register 2 Registers ALU Zero Reslt ALUOp em Data emtoreg RegDst ALUSrc em I [5 - ] Sign etend arch 7, 23 Pipelining 9

emory () The emory () step involves reading the, from the compted by the ALU. Reg Instrction [3-] Instrction I [25-2] I [2-6] I [5 - ] register register 2 register 2 Registers ALU Zero Reslt ALUOp em Data emtoreg RegDst ALUSrc em I [5 - ] Sign etend arch 7, 23 Pipelining

back () Finally, in the back () step, the vale is stored into the destination register. Reg Instrction [3-] Instrction I [25-2] I [2-6] I [5 - ] register register 2 register 2 Registers ALU Zero Reslt ALUOp em Data emtoreg RegDst ALUSrc em I [5 - ] Sign etend arch 7, 23 Pipelining

A bnch of lazy fnctional nits Notice that each eection step ses a different fnctional nit. In other words, the main nits are idle for most of the 8ns cycle! The instrction RA is sed for jst 2ns at the start of the cycle. Registers are read once in (ns), and written once in (ns). The ALU is sed for 2ns near the middle of the cycle. ing the only takes 2ns as well. That s a lot of epensive hardware sitting arond doing nothing. arch 7, 23 Pipelining 2

Ptting those slackers to work We sholdn t have to wait for the entire instrction to complete before we can re-se the fnctional nits. For eample, the instrction is free in the Instrction Decode step as shown below, so... Idle Instrction Decode () Reg Instrction [3-] Instrction I [25-2] I [2-6] I [5 - ] register register 2 register 2 Registers ALU Zero Reslt ALUOp em Data emtoreg RegDst ALUSrc em I [5 - ] Sign etend arch 7, 23 Pipelining 3

Decoding and fetching together Why don t we go ahead and fetch the net instrction while we re decoding the first one? Fetch 2nd Decode st instrction Reg Instrction [3-] Instrction I [25-2] I [2-6] I [5 - ] register register 2 register 2 Registers ALU Zero Reslt ALUOp em Data emtoreg RegDst ALUSrc em I [5 - ] Sign etend arch 7, 23 Pipelining 4

Eecting, decoding and fetching Similarly, once the first instrction enters its Eecte stage, we can go ahead and decode the second instrction. Bt now the instrction is free again, so we can fetch the third instrction! Fetch 3rd Decode 2nd Eecte st Reg Instrction [3-] Instrction I [25-2] I [2-6] I [5 - ] register register 2 register 2 Registers ALU Zero Reslt ALUOp em Data emtoreg RegDst ALUSrc em I [5 - ] Sign etend arch 7, 23 Pipelining 5

Working hard The idea behind pipelining is to maimize the sage of the hardware by overlapping the eection of several instrctions. Each of or five eection steps ses a different fnctional nit. Conceivably we can have p to five instrctions eecting at the same time one in each of the,,, and stages. To make this work, we will go back to the single-cycle path with its mltiple adders and memories, so many instrctions can eecte together withot interference. arch 7, 23 Pipelining 6

A pipeline diagram lw $t, 4($sp) sb $v, $a, $a and $t, $t2, $t3 or $s, $s, $s2 add $sp, $sp, -4 2 3 Clock cycle 4 5 6 7 8 9 A pipeline diagram shows the eection of a series of instrctions. The instrction seqence is shown vertically, from top to bottom. Clock cycles are shown horizontally, from left to right. Each instrction is divided into its component stages. (We show five stages for every instrction, which will make the control nit easier.) This clearly indicates the overlapping of instrctions. For eample, there are three instrctions active in the third cycle above. The lw instrction is in its Eecte stage. Simltaneosly, the sb is in its Instrction Decode stage. Also, the and instrction is jst being fetched. arch 7, 23 Pipelining 7

Pipeline terminology lw $t, 4($sp) sb $v, $a, $a and $t, $t2, $t3 or $s, $s, $s2 add $sp, $sp, -4 2 3 Clock cycle 4 5 6 7 8 9 filling fll emptying The pipeline depth is the nmber of stages in this case, five. In the first for cycles here, the pipeline is filling, since there are nsed fnctional nits. In cycle 5, the pipeline is fll. Five instrctions are being eected simltaneosly, so all hardware nits are in se. In cycles 6-9, the pipeline is emptying. arch 7, 23 Pipelining 8

Performance benefits of pipelining lw $t, 4($sp) sb $v, $a, $a and $t, $t2, $t3 or $s, $s, $s2 add $sp, $sp, -4 2 3 Clock cycle 4 5 6 7 8 9 How mch time wold it take to eecte these five instrctions? We wold need 4ns with a single-cycle CPU with an 8ns cycle time. In a mlticycle design, these instrctions reqire 42ns. With pipelining, we need jst 9 cycles, and 8ns, as shown above! What kind of speedps are we talking abot? Compared to a single-cycle design, 4ns/8ns = 2.2. Verss the mlticycle machine, the speedp is 42ns/8ns = 2.3! arch 7, 23 Pipelining 9

It gets better! lw $t, 4($sp) sb $v, $a, $a and $t, $t2, $t3 or $s, $s, $s2 add $sp, $sp, -4 2 3 Clock cycle 4 5 6 7 8 9 This instrction seqence is too short. ost of the cycles are spent filling or emptying the pipeline. We only achieve fll tilization for one clock cycle. Imagine a longer program with instrctions. It takes 4 cycles to fill the pipeline, and more cycles to complete all the instrctions. So it takes 4 cycles to eecte instrctions. That s almost one cycle per instrction, and a speedp of 3.98 over the single-cycle path! arch 7, 23 Pipelining 2

Almost one cycle per instrction lw $t, 4($sp) sb $v, $a, $a and $t, $t2, $t3 or $s, $s, $s2 add $sp, $sp, -4 2 3 Clock cycle 4 5 6 7 8 9 Here s another way to visalize the benefits of pipelining. In the first for stages of the diagram above, the pipeline is filling. Bt after that, while the pipeline is fll or emptying, yo can see that one instrction completes on every cycle. So ecept for some initial overhead, we really are eecting at the rate of one cycle per instrction. arch 7, 23 Pipelining 2

It s GRRReeeaaAAT! Tony the Tiger is trademark of Kellogg. Don t se me. So pipelining can really deliver the goods. The CPI is nearly, like in a single-cycle processor. Bt the cycle time is very short (2ns), like a mlticycle path. This adds p to ecellent performance! arch 7, 23 Pipelining 22

Ideal speedp lw $t, 4($sp) sb $v, $a, $a and $t, $t2, $t3 or $s, $s, $s2 add $sp, $sp, -4 2 3 Clock cycle 4 5 6 7 8 9 In or pipeline, we can eecte p to five instrctions simltaneosly. This implies that the maimm speedp is 5 times. In general, the ideal speedp eqals the pipeline depth. Why was or speedp on the previos slide only 3.98 times? The pipeline stages are imbalanced: a register file operation can be done in ns, bt we mst stretch that ot to 2ns to keep the and stages synchronized with, and. Balancing the stages is one of the many hard parts in designing a pipelined processor. arch 7, 23 Pipelining 23

The pipelining parado lw $t, 4($sp) sb $v, $a, $a and $t, $t2, $t3 or $s, $s, $s2 add $sp, $sp, -4 2 3 Clock cycle 4 5 6 7 8 9 Pipelining does not improve the eection time of any single instrction. Each instrction here actally takes longer to eecte than in a singlecycle path (ns vs. 8ns)! Instead, pipelining increases the throghpt, or the amont of work done per nit time. Here, several instrctions are eected together in each clock cycle. The reslt is improved eection time for a seqence of instrctions, sch as an entire program. arch 7, 23 Pipelining 24

Smmary Pipelining attempts to maimize hardware sage by overlapping the eection stages of several different instrctions. Pipelining offers amazing speedp. The CPU throghpt is dramatically improved, becase several instrctions can be eecting concrrently. In the best case, one instrction finishes on every cycle, and the speedp is eqal to the pipeline depth. Net time we ll see an actal path and control nit for a pipelined processor, so we can see how to make all these wild ideas work. arch 7, 23 Pipelining 25