For the purposes of this document a bit is a bit, a byte is 8 bits, a word is 16 bits, and a dword is 32 bits.
The shader processor has a Harvard architecture. A 32 bit instruction (d)word is used. There are eight general purpose 32 bit registers, and a single hidden 16 bit program counter register. The processor addresses ram at a granularity of 4 bytes (!not byte addressable!). The processor does not implement a branch delay slot. There is no condition code register. Exceptions are not triggered for overflow, overflow detection requires an explicit test.
The data memory of the shader processor is very small and expected to be very local to the processor. The framebuffer is also only addressable at dword granularity, is presumed located 'far' from the shader processor and can be accessed only by DMAing data into or out of the shader local memory. It is anticipated that shader programs will manually prefetch required data to hide memory latency. Shader program execution is not effected by DMA operation, except by inadvertent (silent) memory collisions.
Due to the FPGA target, instruction selection was not deemed critical. At the inception of the processor design it was expected that shader programs would be hand written in assembly, so instructions that simplified hand assembler coding like bit sets/clears and branches on bit states were added. Instructions were selected while writing the first programs, what became the floating point support library. Eight registers were settled upon when developing these first programs were complicated by the initial choice of four GP registers.
Hardware floating point support was omitted in this implementation for several reasons:
Previous TVC releases successfully used only integer math in the pixel pipeline
The developer's unfamiliarity with the data format (partially learned during this implementation)
The primary challenge in implementing a graphics core appears to be exploiting massive parallelism at the graphics primitive level (polygons, fragments, etc.), not maximizing work at the instruction level; so why complicate this first implementation with fp support.
Critical: add instructions and dedicated hardware to support synchronized operation between shader processors. Explore vector opps and registers, i.e. add some SIMD instructions. Maybe add fp. Decisions relating to how best to integrate the DSP elements within Xilinx series-7 FPGAs are expected to heavily effect instruction selection and overall architecture in future revisions.
Shader program length, working set size, and mix of available logic resources on future FPGA targets will determine if a new memory scheme is needed. One possible alternative is for instruction and data rams to be switchable between the current internal only mode and a conventional cache mode. I think this could be implemented by small tweaks to the DMA unit(s).
A far future design question is if it would be beneficial to be able to dynamically partition a sub-array of shader cores to create small pools of coherence for certain tasks.
Upper 7 bits are instruction opp (O), next 3 bits are source register a (A), next 3 bits are source register b (B), next three bits are destination register (D). The final 16 bits of the instruction dword is the immediate word (I).
so grouped by [] into 8 bits:
(31) [OOOOOOOA] [AABBBDDD] [IIIIIIII] [IIIIIIII] (0)
If the immediate word is used as an address, this allows 65535*4 bytes of native address space (256KB) to both the data and program memories. Given many of these processors are supposed to be tiled onto a single FPGA this address space seems large enough.
Note: In the table below, %A, %B, %D, and %R refer to any GP register. The register letters in this section refer to the instruction encoding documented in the previous section.
Mnemonic |
Desc |
Enc |
|
NOP |
No operation |
0x00 |
Lower bits ignored. |
LUI %R IMM |
Load Upper from Immediate |
0x01 |
Sets highest 16 bits of register to be Immediate assembler encodes %A and %D as specified %R |
LLI %R IMM |
Load Lower from Immediate |
0x02 |
Sets lowest 16 bits of register to be Immediate assembler encodes %A and %D as specified %R |
ADD %A %B %D |
Add %A to %B and store in %D |
0x03 |
|
SUB %A %B %D |
Subtract %B from %A and store in %D |
0x04 |
Inputs and outputs are signed. |
ADDL %A %D IMM |
Add SIGN EXTENDED IMM to %A and store in %D |
0x05 |
|
AND %A %B %D |
Logical AND %A with %B and store in %D |
0x06 |
|
OR %A %B %D |
Logical OR %A with %B and store in %D |
0x07 |
|
XOR %A %B %D |
Logical XOR %A with %B and store in %D |
0x08 |
|
NOT %A %D |
Logical NOT %A and store in %D |
0x09 |
|
BSET %A %D IMM |
Set bit IMM from %A store in %D |
0x0a |
IMM cannot be larger than 31 |
BCLR %A %D IMM |
Clear bit IMM from %A store in %D |
0x0b |
IMM cannot be larger than 31 |
RSL %A %D IMM |
Left shift %A by IMM bits and store in %D |
0x0c |
IMM cannot be larger than 31, must be positive a logical shift is performed only |
RSR %A %D IMM |
Right shift %A by IMM bits and store in %D |
0x0d |
IMM cannot be larger than 31, must be positive a logical shift is performed only |
MUL %A %B %D |
Multiply %A by %B and store result in %D |
0x10 |
Only implement a 16 bit multiply? Right now-> yes upper src bits are ignored. |
CMP %A %B %D |
Compare %A to %B put results in %D |
0x30 |
Bit0 equal, bit1 R1 > R2 unsigned; bit2 R1 > R2 signed bit3 R1 < R2 unsigned; bit4 R1 < R2 signed |
SRI %A IMM |
Store Register %A to Immediate |
0x40 |
Store register to address held in the Immediate word |
SRR %A %B IMM |
Store Register %A to Address generated by %B + sign extended IMM |
0x41 |
Store register %A to value held at address held in register %B + IMM |
LRI %D IMM |
Load %D from Address in Immediate |
0x42 |
Load %D from address held in Immediate word |
LRR %B %D IMM |
Load %D from Address in %B with sign extended C added to %A |
0x43 |
Not encoded as %A to save an adder in load-store.vhd |
SEQZ %A |
Skip next instruction if %A is zero |
0x50 |
|
SNEQZ %A |
Skip next instruction if %A is not zero |
0x51 |
|
SBSET %A IMM |
Skip next instruction if bit IMM of %A is set |
0x52 |
C cannot be larger than 31 |
SBCLR %A IMM |
Skip next instruction if bit IMM of %A is clear |
0x53 |
C cannot be larger than 31 |
JI IMM |
Jump to address in Immediate |
0x60 |
|
JR %A |
Jump to address in register %A |
0x61 |
Perhaps add with sign extended IMM added to address in %A, but this seems too complex for now. |
HLT |
Halt |
0x70 |
|
Register 5 (%F) is used as a stack frame pointer (by convention only, see below).
Register 6 (%G) is an assembler temporary. As it is only used during the call and return assembler macros, it can be normally used outside of these calls, provided the programer knows its value will not be preserved through call or returns.
Register 7 (%H) is treated as the stack pointer. The stack grows down. The stack pointer points to the next open slot. In other words in writing to the stack the write occurs before the SP is decremented.
LI %R c
does a LUI and a LLI of the constant into register R
PUSH %R
Pushes register A's contents to stack and decrements stack pointer by 1 (dword addressable)
PULL %R
Pulls value from stack, inserts value into register R and increments stack pointer by 1
CALL <addr>
Computes the proper return address and puts in %G (AT) then jumps to addr. See Notes Section.
FNSETUP
The second half of a function call. FNSETUP pushes all register contents to stack (except %H). Register %G (The CALL calculated return address) is saved first at stack offset '0'. It then updates the stack pointer considering all values added. Independent PUSHs are not preformed. Expected to be called as the first 'instruction' of a function. See Notes Section.
RETURN
restores (PULLs) all register contents from stack (except registers G, and H).
PULLs return address from stack
jumps to that address
The assembler Macro Call pushes computes the desired return address and leaves it in the %G (AT) register, then branches to the call address. The callee is expected to save processor register state on the stack via FNSETUP, using 7 stack locations. The Compiler know this and writes function arguments into the space below this -7 offset so that after the CALL, FNSETUP pair the new functions frame pointer is setup pointing to these blocks
Procedure for a Function call:
The Caller:
considers space required for all registers to be saved. (space on stack to be used by CALL, FNSETUP macros)
considers space required for return value (1 more gap below CALL, FNSETUP modified stack pointer)
writes function parameters into space below the SP value calculated above.
performs the CALL (expected behavior – nothing tricky here).
Final instruction of CALL macro jumps to function address
The Callee:
Calls FNSETUP to save caller's register state. (expected behavior)
The first instructions of a function are compiler generated (not user code).
The stack pointer is copied into the %F register, forming the frame pointer for all the function’s variables. The compiler uses this frame pointer to access the return variable value, all named local local variables, and compiler temporaries.
The stack pointer is decremented by the space required by all of the above, so the next PUSH does not stomp on anything.
Upon a compiled code return <value>
The <value> is copied to the reserved space in the stack frame.
Restores the stack pointer from from our 'frame' pointer
issues the RETURN assembler macro
The shader core has an integrated 4 entry scatter gather DMA unit that allows the shader to transfer data between the shader memory and the framebuffer. The DMA unit operates independently of the execution of the shader core so the shader can save results and/or load new raw data while operating on some previously fetched data. This is the mechanism by which the shader core can hide memory latencies.
The DMA unit is interfaced by writing DMA instructions into special addresses at the end of the data memory address space. These memory locations allow the shader programs to activate or check the status of the DMA unit.
The DMA unit has slots for doing up to four DMA operations. Each operation is programmed by two dword values. The first dword is the DMA_CMD and the second dword is the DMA_FB_ADDR (the framebuffer address). The lowest word of the DMA_CMD dword is the 16bit address for the DMA in the shader processor.'s data ram. The lowest 12 bits of the upper word in the DMA_CMD is the number of transfers. The highest bit in the DMA_CMD is the direction. If the highest bit is set, we have a transfer into the shader, if it is not set transfer out of the shader.
DMA commands are executed in order. The DMA is triggered by writing a count value for the number of transfers to perform (i.e. 1,2,3, or 4) to one address past the end of the 8 configuration registers. Reading from this same location returns a non-zero value if the DMA unit is busy, or a zero if the DMA unit is idle.
DMA_CMD_X |
Lower 16 bits are transfer address (A). High bit (D) is direction bit (set 1 for reading into shader ram from framebuffer). Bits ( C) are for the number of dwords transfered. bits (X) are not examined. So: [D][XXX][CCCCCCCCCCCC][AAAAAAAAAAAAAAAA] |
DMA_FB_ADDR_X |
Starting address in framebuffer |
See Illustration 1 for a diagram. The shader processor v2 has four stages:
Fetch
Buffer
Compute & Load/Store (to ram)
Retire (register operations)
Illustration 1: Pipeline Diagram v2a
As stated above, the instruction and address spaces are 65536 dwords long. The implemented data space in each ram is expected to be significantly less than this value. The 256 addresses 0xFF00 – 0xFFFF in the data address space are considered to be external to the shader core. The shader's load_store unit traps reads and writes to this address range and performs reads/writes to a processor local bus that has 8 bit addresses, and 32 bit data. It is though this local data bus that the DMA unit is activated by the shader core. This local data bus is very similar to the TVC's top level command bus. It is expected that specialized functional units will be designed and implemented first on the command bus, then moved from the top-level command bus to these processor local data buses. It is through this architectural feature that accelerated processing units will be added to the shader cores.
The buffer stage as added pre-release to double clock speed from 40-80MHz. (Spartan IIIe)
Pre-release versions of the assembler combined FNSETUP into CALL, making the caller responsible for saving its state. Because there are normally many more function calls than function definitions in a program; it makes sense to keep most of the overhead of a function call in the calle, shrinking the emitted code size.
The first optimization was to have call only compute the return address and push it onto the stack, then jump to the callee which would first call FNSTUP, thereby pushing the rest of the registers to the stack. The cost of this optimization is a single instruction; the stack pointer is modified once in CALL and once in FNSETUP.
The next optimization is to just have the CALL just compute the return address and put it on the top of the stack, but not modify the stack pointer in CALL, rather the matching FNSETUP assumes the return address is already there, but the stack pointer not modified. This adds no cost and reduces the binary size.
The final optimization is just to keep the return address in %G (Assembler temp) for CALL, a very conventional design.
These machinations were done to squeeze more functionality into the fixed size instruction ram.