[lore] article for lore
Davyd Madeley
davyd at madeley.id.au
Thu Aug 17 00:50:37 WST 2006
encl. text and diagrams.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: delay-slots.svg
Type: image/svg+xml
Size: 55007 bytes
Desc: not available
Url : http://lists.ucc.gu.uwa.edu.au/pipermail/lore/attachments/20060817/caa70d7b/delay-slots-0001.svg
-------------- next part --------------
Delay Slots - What Are They and Why are They Way Cool?
======================================================
-- Davyd Madeley --
I had no idea what to write for this, so I thought I might write something that
computer architectures that not everyone knows. Specifically, what is a delay
slot, and why is it my favourite piece of computer architecture?
Delay slots appear in several common (RISC) architectures, including MIPS,
PA-RISC and SPARC CPUs. In brief, they are a way to optimise the pipelining of
processor instructions, to prevent "bubbles" in the pipeline. But what is a
pipeline? Well, let's go back to the beginning...
Computers execute instructions they read from memory. We can imagine a simple
case where all instructions, including their arguments are one word long (a word
is 32-bits on a 32-bit machine and 64-bits on a 64-bit machine). Some
architectures have variable length instructions, but we won't let that confuse
the issue. Our program may look something like this:
Address Instruction
100 LOAD X, A
101 ADD #1, A
102 JUMP 105
103 ADD A, B
104 SUBTRT C, B
105 STORE A, Z
106 LOAD X, B
This program loads some memory from location X into a register (A), increments
the register by one, then jumps to 105 and stores the value into location Z.
It all seems simple enough, the problem here is that for each line in this
program we have to carry out a number of stages: we have to fetch the
instruction from memory and decode it, we have to execute the instruction,
and if appropriate we have to access memory (to load and store values).
Remember that CPUs are clocked, that is that they only carry out an operation
on the electric clock pulse (in your 3.0GHz Pentium, these clock pulses arrive
3 billion times a second). We could spend time in each clock pulse fetching,
decoding, executing and accessing memory, but we wouldn't be able to have a
very fast clock. Instead, we can do all of these things in parallel, passing
each instruction in turn down the "pipeline". Our hypothetical CPU only has
four stages in it's pipeline, real CPUs have anywhere up to 40 stages in the
pipeline.
The more stages in the pipeline, the more operations we can conduct in parallel,
but there is a trade off. Branches (such as our JUMP) statement introduce
"bubbles" into the pipeline. You can see the Fetch instruction loads addresses
103 and 104 which will not be executed, so they have to be thrown away. In
short, branches and jumps invalidate the pipeline, causing bubbles
and reducing performance.
So what's the solution? One solution is to have delay slots. Delay slots are
instructions that are executed after a branch or jump statement. They will
always be executed, regardless of the outcome of the branch. A CPU with one
delay slot executes one instruction after a branch. 1-2 delay slots are common,
some CPUs have as many as 5-delay slots.
If we implement our CPU with a single delay slot. We can rewrite our program
like this:
Address Instruction
100 LOAD X, A
101 ADD #1, A
102 JUMP 106
103 STORE A, Z <-- delay slot
104 ADD A, B
105 SUBTRT C, B
106 LOAD X, B
You can see that our program is the same length, all we have done is moved
the STORE instruction from address 105 to address 103, and changed the address
of the JUMP instruction to point to the next instruction we want to execute.
You'll notice however that this minor change has already improved the number
of clock cycles required to execute the program by one (the last instruction
is one stage further on in the pipeline).
If we go up to 2 delay slots:
Address Instruction
100 LOAD X, A
101 ADD #1, A
102 JUMP 107
103 STORE A, Z <-- delay slot #1
104 LOAD X, B <-- delay slot #2
105 ADD A, B
106 SUBTRT C, B
You can see that the program is still the same length, but again the execution
time has been decreased.
So what's the catch? The catch is that not all branches are unconditional. The
work in placing instructions in these delay slots rests upon the compiler.
This means that your compiler has to be that little bit smarter. If your branch
statement depends on a value you need to have already calculated, you can't put
that calculation in a delay slot. Take this example for a CPU with no delay
slots:
Address Instruction
100 LOAD #20, A
101 LOAD #0, B
102 SUBTRT #1, A
103 ADD A, B <-- add A to B, store in B
104 BRA A, 102 <-- branch to 106 if A is positive
105 STORE B, X
In our CPU with 2 delay slots, things get a bit tricker. Our branch statement
depends on A, so we may not be able to move our SUBTRACT. We can change B
without effecting the branch, so we can move that instruction into a delay slot.
Address Instruction
100 LOAD #20, A
101 LOAD #0, B
102 SUBTRT #1, A
103 BRA A, 102
104 ADD A, B <-- delay slot #1
105 NOOP <-- delay slot #2, no operation
106 STORE B, X
You can see our program got slightly longer, because we had to add a "no
operation" instruction in the second delay slot. However, in this particular
case, it is possible to optimise this even further:
Address Instruction
100 LOAD #19, A <-- predecrement
101 LOAD #0, B
102 BRA A, 102
103 SUBTRT #1, A <-- delay slot #1
104 ADD A, B <-- delay slot #2
105 STORE B, X
In this case, by decrementing the initial value of A by one (so that A had the
correct value when evaluating BRA) we were able to move the substraction into
the delay slot. By doing this it means the final value of A after the loop will
be -1 rather than 0. We might be using this value later on, so the compiler has
to be aware of that. The compiler has to be increasingly clever to find such
optimisations.
That's pretty much it. Delay slots are an excellent design feature to optimise
the pipelining of CPU instructions, relying on the smarts of the compiler rather
than hardware (much harder to debug, incrementally improve and implement). For
the record, (recent) GCC does support delay slots (and will use them on
appropriate architectures), although some commercial, architecture specific
compilers will do a better job.
Attribution:
Some of the examples in this article came from _Computer Organization &
Architecture_ 6th ed. by William Stallings, published by Prentice Hall.
More information about the lore
mailing list