Shader Processor v2a

For the purposes of this document a bit is a bit, a byte is 8 bits, a word is 16 bits, and a dword is 32 bits.

The shader processor has a Harvard architecture. A 32 bit instruction (d)word is used. There are eight general purpose 32 bit registers, and a single hidden 16 bit program counter register. The processor addresses ram at a granularity of 4 bytes (!not byte addressable!). The processor does not implement a branch delay slot. There is no condition code register. Exceptions are not triggered for overflow, overflow detection requires an explicit test.

The data memory of the shader processor is very small and expected to be very local to the processor. The framebuffer is also only addressable at dword granularity, is presumed located 'far' from the shader processor and can be accessed only by DMAing data into or out of the shader local memory. It is anticipated that shader programs will manually prefetch required data to hide memory latency. Shader program execution is not effected by DMA operation, except by inadvertent (silent) memory collisions.

Due to the FPGA target, instruction selection was not deemed critical. At the inception of the processor design it was expected that shader programs would be hand written in assembly, so instructions that simplified hand assembler coding like bit sets/clears and branches on bit states were added. Instructions were selected while writing the first programs, what became the floating point support library. Eight registers were settled upon when developing these first programs were complicated by the initial choice of four GP registers.

Hardware floating point support was omitted in this implementation for several reasons:

Previous TVC releases successfully used only integer math in the pixel pipeline
The developer's unfamiliarity with the data format (partially learned during this implementation)
The primary challenge in implementing a graphics core appears to be exploiting massive parallelism at the graphics primitive level (polygons, fragments, etc.), not maximizing work at the instruction level; so why complicate this first implementation with fp support.

v3 / Future Enhancements

Critical: add instructions and dedicated hardware to support synchronized operation between shader processors. Explore vector opps and registers, i.e. add some SIMD instructions. Maybe add fp. Decisions relating to how best to integrate the DSP elements within Xilinx series-7 FPGAs are expected to heavily effect instruction selection and overall architecture in future revisions.

Shader program length, working set size, and mix of available logic resources on future FPGA targets will determine if a new memory scheme is needed. One possible alternative is for instruction and data rams to be switchable between the current internal only mode and a conventional cache mode. I think this could be implemented by small tweaks to the DMA unit(s).

A far future design question is if it would be beneficial to be able to dynamically partition a sub-array of shader cores to create small pools of coherence for certain tasks.

Instruction Encoding

Upper 7 bits are instruction opp (O), next 3 bits are source register a (A), next 3 bits are source register b (B), next three bits are destination register (D). The final 16 bits of the instruction dword is the immediate word (I).

so grouped by [] into 8 bits:

(31) [OOOOOOOA] [AABBBDDD] [IIIIIIII] [IIIIIIII] (0)

If the immediate word is used as an address, this allows 65535*4 bytes of native address space (256KB) to both the data and program memories. Given many of these processors are supposed to be tiled onto a single FPGA this address space seems large enough.

Native Instructions

Note: In the table below, %A, %B, %D, and %R refer to any GP register. The register letters in this section refer to the instruction encoding documented in the previous section.

Mnemonic	Desc	Enc
NOP	No operation	0x00	Lower bits ignored.
LUI %R IMM	Load Upper from Immediate	0x01	Sets highest 16 bits of register to be Immediate assembler encodes %A and %D as specified %R
LLI %R IMM	Load Lower from Immediate	0x02	Sets lowest 16 bits of register to be Immediate assembler encodes %A and %D as specified %R
ADD %A %B %D	Add %A to %B and store in %D	0x03
SUB %A %B %D	Subtract %B from %A and store in %D	0x04	Inputs and outputs are signed.
ADDL %A %D IMM	Add SIGN EXTENDED IMM to %A and store in %D	0x05
AND %A %B %D	Logical AND %A with %B and store in %D	0x06
OR %A %B %D	Logical OR %A with %B and store in %D	0x07
XOR %A %B %D	Logical XOR %A with %B and store in %D	0x08
NOT %A %D	Logical NOT %A and store in %D	0x09
BSET %A %D IMM	Set bit IMM from %A store in %D	0x0a	IMM cannot be larger than 31
BCLR %A %D IMM	Clear bit IMM from %A store in %D	0x0b	IMM cannot be larger than 31
RSL %A %D IMM	Left shift %A by IMM bits and store in %D	0x0c	IMM cannot be larger than 31, must be positive a logical shift is performed only
RSR %A %D IMM	Right shift %A by IMM bits and store in %D	0x0d	IMM cannot be larger than 31, must be positive a logical shift is performed only
MUL %A %B %D	Multiply %A by %B and store result in %D	0x10	Only implement a 16 bit multiply? Right now-> yes upper src bits are ignored.
CMP %A %B %D	Compare %A to %B put results in %D	0x30	Bit0 equal, bit1 R1 > R2 unsigned; bit2 R1 > R2 signed bit3 R1 < R2 unsigned; bit4 R1 < R2 signed
SRI %A IMM	Store Register %A to Immediate	0x40	Store register to address held in the Immediate word
SRR %A %B IMM	Store Register %A to Address generated by %B + sign extended IMM	0x41	Store register %A to value held at address held in register %B + IMM
LRI %D IMM	Load %D from Address in Immediate	0x42	Load %D from address held in Immediate word
LRR %B %D IMM	Load %D from Address in %B with sign extended C added to %A	0x43	Not encoded as %A to save an adder in load-store.vhd
SEQZ %A	Skip next instruction if %A is zero	0x50
SNEQZ %A	Skip next instruction if %A is not zero	0x51
SBSET %A IMM	Skip next instruction if bit IMM of %A is set	0x52	C cannot be larger than 31
SBCLR %A IMM	Skip next instruction if bit IMM of %A is clear	0x53	C cannot be larger than 31
JI IMM	Jump to address in Immediate	0x60
JR %A	Jump to address in register %A	0x61	Perhaps add with sign extended IMM added to address in %A, but this seems too complex for now.
HLT	Halt	0x70

ABI

Register 6 (%G) is an assembler temporary. As it is only used during the call and return assembler macros, it can be normally used outside of these calls, provided the programer knows its value will not be preserved through call or returns.

Register 7 (%H) is treated as the stack pointer. The stack grows down. The stack pointer points to the next open slot. In other words in writing to the stack the write occurs before the SP is decremented.

Assembler Macros

LI %R c

does a LUI and a LLI of the constant into register R

PUSH %R

Pushes register A's contents to stack and decrements stack pointer by 1 (dword addressable)

PULL %R

Pulls value from stack, inserts value into register R and increments stack pointer by 1

CALL <addr>

Computes the proper return address and puts in %G (AT) then jumps to addr. See Notes Section.

FNSETUP

The second half of a function call. FNSETUP pushes all register contents to stack (except %H). Register %G (The CALL calculated return address) is saved first at stack offset '0'. It then updates the stack pointer considering all values added. Independent PUSHs are not preformed. Expected to be called as the first 'instruction' of a function. See Notes Section.

RETURN

restores (PULLs) all register contents from stack (except registers G, and H).

PULLs return address from stack

jumps to that address

Compiler Function Calls

The assembler Macro Call pushes computes the desired return address and leaves it in the %G (AT) register, then branches to the call address. The callee is expected to save processor register state on the stack via FNSETUP, using 7 stack locations. The Compiler know this and writes function arguments into the space below this -7 offset so that after the CALL, FNSETUP pair the new functions frame pointer is setup pointing to these blocks

Procedure for a Function call:

The Caller:

considers space required for all registers to be saved. (space on stack to be used by CALL, FNSETUP macros)

considers space required for return value (1 more gap below CALL, FNSETUP modified stack pointer)

writes function parameters into space below the SP value calculated above.

performs the CALL (expected behavior – nothing tricky here).

Final instruction of CALL macro jumps to function address

The Callee:

Calls FNSETUP to save caller's register state. (expected behavior)

The first instructions of a function are compiler generated (not user code).

The stack pointer is copied into the %F register, forming the frame pointer for all the function’s variables. The compiler uses this frame pointer to access the return variable value, all named local local variables, and compiler temporaries.

The stack pointer is decremented by the space required by all of the above, so the next PUSH does not stomp on anything.

Upon a compiled code return <value>

The <value> is copied to the reserved space in the stack frame.

Restores the stack pointer from from our 'frame' pointer

issues the RETURN assembler macro

DMA Unit (software Interface)

The shader core has an integrated 4 entry scatter gather DMA unit that allows the shader to transfer data between the shader memory and the framebuffer. The DMA unit operates independently of the execution of the shader core so the shader can save results and/or load new raw data while operating on some previously fetched data. This is the mechanism by which the shader core can hide memory latencies.

The DMA unit is interfaced by writing DMA instructions into special addresses at the end of the data memory address space. These memory locations allow the shader programs to activate or check the status of the DMA unit.

The DMA unit has slots for doing up to four DMA operations. Each operation is programmed by two dword values. The first dword is the DMA_CMD and the second dword is the DMA_FB_ADDR (the framebuffer address). The lowest word of the DMA_CMD dword is the 16bit address for the DMA in the shader processor.'s data ram. The lowest 12 bits of the upper word in the DMA_CMD is the number of transfers. The highest bit in the DMA_CMD is the direction. If the highest bit is set, we have a transfer into the shader, if it is not set transfer out of the shader.

DMA commands are executed in order. The DMA is triggered by writing a count value for the number of transfers to perform (i.e. 1,2,3, or 4) to one address past the end of the 8 configuration registers. Reading from this same location returns a non-zero value if the DMA unit is busy, or a zero if the DMA unit is idle.

DMA_CMD_X

Lower 16 bits are transfer address (A). High bit (D) is direction bit (set 1 for reading into shader ram from framebuffer). Bits ( C) are for the number of dwords transfered. bits (X) are not examined. So:

[D][XXX][CCCCCCCCCCCC][AAAAAAAAAAAAAAAA]

DMA_FB_ADDR_X

Starting address in framebuffer

RTL Implementation

See Illustration 1 for a diagram. The shader processor v2 has four stages:

Fetch
Buffer
Compute & Load/Store (to ram)
Retire (register operations)

Illustration 1: Pipeline Diagram v2a

Data Address Space

As stated above, the instruction and address spaces are 65536 dwords long. The implemented data space in each ram is expected to be significantly less than this value. The 256 addresses 0xFF00 – 0xFFFF in the data address space are considered to be external to the shader core. The shader's load_store unit traps reads and writes to this address range and performs reads/writes to a processor local bus that has 8 bit addresses, and 32 bit data. It is though this local data bus that the DMA unit is activated by the shader core. This local data bus is very similar to the TVC's top level command bus. It is expected that specialized functional units will be designed and implemented first on the command bus, then moved from the top-level command bus to these processor local data buses. It is through this architectural feature that accelerated processing units will be added to the shader cores.

Notes

The buffer stage as added pre-release to double clock speed from 40-80MHz. (Spartan IIIe)

Pre-release versions of the assembler combined FNSETUP into CALL, making the caller responsible for saving its state. Because there are normally many more function calls than function definitions in a program; it makes sense to keep most of the overhead of a function call in the calle, shrinking the emitted code size.

The first optimization was to have call only compute the return address and push it onto the stack, then jump to the callee which would first call FNSTUP, thereby pushing the rest of the registers to the stack. The cost of this optimization is a single instruction; the stack pointer is modified once in CALL and once in FNSETUP.

The next optimization is to just have the CALL just compute the return address and put it on the top of the stack, but not modify the stack pointer in CALL, rather the matching FNSETUP assumes the return address is already there, but the stack pointer not modified. This adds no cost and reduces the binary size.

The final optimization is just to keep the return address in %G (Assembler temp) for CALL, a very conventional design.

These machinations were done to squeeze more functionality into the fixed size instruction ram.