[lore] article for lore

Thu Aug 17 00:50:37 WST 2006

encl. text and diagrams.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: delay-slots.svg
Type: image/svg+xml
Size: 55007 bytes
Desc: not available
Url : http://lists.ucc.gu.uwa.edu.au/pipermail/lore/attachments/20060817/caa70d7b/delay-slots-0001.svg
-------------- next part --------------
Delay Slots - What Are They and Why are They Way Cool?
======================================================
  -- Davyd Madeley --

I had no idea what to write for this, so I thought I might write something that
computer architectures that not everyone knows. Specifically, what is a delay
slot, and why is it my favourite piece of computer architecture?

Delay slots appear in several common (RISC) architectures, including MIPS,
PA-RISC and SPARC CPUs. In brief, they are a way to optimise the pipelining of
processor instructions, to prevent "bubbles" in the pipeline. But what is a
pipeline? Well, let's go back to the beginning...

Computers execute instructions they read from memory. We can imagine a simple
case where all instructions, including their arguments are one word long (a word
is 32-bits on a 32-bit machine and 64-bits on a 64-bit machine). Some
architectures have variable length instructions, but we won't let that confuse
the issue. Our program may look something like this:

	Address		Instruction
	100		LOAD	X, A
	101		ADD	#1, A
	102		JUMP	105
	103		ADD	A, B
	104		SUBTRT	C, B
	105		STORE	A, Z
	106		LOAD    X, B

This program loads some memory from location X into a register (A), increments
the register by one, then jumps to 105 and stores the value into location Z.

It all seems simple enough, the problem here is that for each line in this
program we have to carry out a number of stages: we have to fetch the
instruction from memory and decode it, we have to execute the instruction,
and if appropriate we have to access memory (to load and store values).
Remember that CPUs are clocked, that is that they only carry out an operation
on the electric clock pulse (in your 3.0GHz Pentium, these clock pulses arrive
3 billion times a second). We could spend time in each clock pulse fetching,
decoding, executing and accessing memory, but we wouldn't be able to have a
very fast clock. Instead, we can do all of these things in parallel, passing
each instruction in turn down the "pipeline". Our hypothetical CPU only has
four stages in it's pipeline, real CPUs have anywhere up to 40 stages in the
pipeline.

The more stages in the pipeline, the more operations we can conduct in parallel,
but there is a trade off. Branches (such as our JUMP) statement introduce
"bubbles" into the pipeline. You can see the Fetch instruction loads addresses
103 and 104 which will not be executed, so they have to be thrown away. In
short, branches and jumps invalidate the pipeline, causing bubbles
and reducing performance.

So what's the solution? One solution is to have delay slots. Delay slots are
instructions that are executed after a branch or jump statement. They will
always be executed, regardless of the outcome of the branch. A CPU with one
delay slot executes one instruction after a branch. 1-2 delay slots are common,
some CPUs have as many as 5-delay slots.

If we implement our CPU with a single delay slot. We can rewrite our program
like this:

	Address		Instruction
	100		LOAD	X, A
	101		ADD	#1, A
	102		JUMP	106
	103		STORE	A, Z	<-- delay slot
	104		ADD	A, B
	105		SUBTRT	C, B
	106		LOAD    X, B

You can see that our program is the same length, all we have done is moved
the STORE instruction from address 105 to address 103, and changed the address
of the JUMP instruction to point to the next instruction we want to execute.
You'll notice however that this minor change has already improved the number
of clock cycles required to execute the program by one (the last instruction
is one stage further on in the pipeline).

If we go up to 2 delay slots:

	Address		Instruction
	100		LOAD	X, A
	101		ADD	#1, A
	102		JUMP	107
	103		STORE	A, Z	<-- delay slot #1
	104		LOAD    X, B	<-- delay slot #2
	105		ADD	A, B
	106		SUBTRT	C, B

You can see that the program is still the same length, but again the execution
time has been decreased.

So what's the catch? The catch is that not all branches are unconditional. The
work in placing instructions in these delay slots rests upon the compiler.
This means that your compiler has to be that little bit smarter. If your branch
statement depends on a value you need to have already calculated, you can't put
that calculation in a delay slot. Take this example for a CPU with no delay
slots:

	Address		Instruction
	100		LOAD	#20, A
	101		LOAD	#0, B
	102		SUBTRT	#1, A
	103		ADD	A, B	<-- add A to B, store in B
	104		BRA	A, 102  <-- branch to 106 if A is positive
	105		STORE	B, X

In our CPU with 2 delay slots, things get a bit tricker. Our branch statement
depends on A, so we may not be able to move our SUBTRACT. We can change B
without effecting the branch, so we can move that instruction into a delay slot.

	Address		Instruction
	100		LOAD	#20, A
	101		LOAD	#0, B
	102		SUBTRT	#1, A
	103		BRA	A, 102
	104		ADD	A, B	<-- delay slot #1
	105		NOOP		<-- delay slot #2, no operation
	106		STORE	B, X

You can see our program got slightly longer, because we had to add a "no
operation" instruction in the second delay slot. However, in this particular
case, it is possible to optimise this even further:

	Address		Instruction
	100		LOAD	#19, A	<-- predecrement
	101		LOAD	#0, B
	102		BRA	A, 102
	103		SUBTRT	#1, A	<-- delay slot #1
	104		ADD	A, B	<-- delay slot #2
	105		STORE	B, X

In this case, by decrementing the initial value of A by one (so that A had the
correct value when evaluating BRA) we were able to move the substraction into
the delay slot. By doing this it means the final value of A after the loop will
be -1 rather than 0. We might be using this value later on, so the compiler has
to be aware of that. The compiler has to be increasingly clever to find such
optimisations.

That's pretty much it. Delay slots are an excellent design feature to optimise
the pipelining of CPU instructions, relying on the smarts of the compiler rather
than hardware (much harder to debug, incrementally improve and implement). For
the record, (recent) GCC does support delay slots (and will use them on
appropriate architectures), although some commercial, architecture specific
compilers will do a better job.

Attribution:
 Some of the examples in this article came from _Computer Organization &
 Architecture_ 6th ed. by William Stallings, published by Prentice Hall.