TVC R5 “Simple Shader”

This release of the TVC project includes a newly designed shader processor, a single one of which has been instantiated by the RTL. This shader processor is supported by a compiler (that implements a c-like language), an assembler/linker, a graphical simulator/debugger, and a simulation framework for the RTL.

The shader processor is designed to have dedicated local small program and data memories and it interacts with the framebuffer using an integrated DMA unit, or potentially indirectly by functional units mapped into its io address space. Clever use of the DMA unit should allow hiding memory access latency. Accessing the framebuffer only though the DMA unit effectively dispenses with the idea of a coherent view of framebuffer ram. It is expected that future implementations will synchronize execution of shader programs and access to the framebuffer though explicit means. By dispensing with memory coherence the enormous task and operational cost of implementing it is lost. This design is based on the assumption that graphics algorithms are sufficiently embarrassingly parallel and have small enough working sets that these shader cores will work mostly independently.

The primary purpose of this text is to document the overall design of this TVC release. Please see the Shader Processor v2a document for a further description of the shader processor, its instruction set and related ABI.

Design Overview

This release of the TVC consists of 34 VHDL modules. The overall organization of these modules may be seen in Illustration 1. The most important structural component of the TVC is the MCU. The MCU provides and controls a Memory User (MU) bus that runs throughout the TVC to all units that need to access or modify the frame buffer contents. The MCU through its operation schedules the interaction of all components within the TVC. A Pixel Pusher module passively reads framebuffer contents via the MU bus and displays it on an attached display. A set of Functional Units (FUs) connect to the MU bus and perform drawing calls on behalf of an external host. The external host talks to and gives high level commands to the set of FUs though the Command Bus.

Illustration 1: TVC Overall Structure

MU Bus

The Memory Users (MUs) are driven by the Memory Controller Unit (MCU) through the MU Bus. The MCU when idle checks to see if any MUs have raised an interrupt. If a scanline cache MU has raised an interrupt, that interrupt is serviced ahead of any other MU. This is done to (help) guarantee that pixel values will be available to the pixel pusher when it needs them to output to the display. MUs other than the scanline caches are serviced in round-robin order. A MU may have to wait an arbitrarily long time before being serviced by the MCU. After selecting a given MU, the MCU will completely satisfy the request of that MU before it moves on to another MU. This means that the maximum memory request length cannot allow starvation of the scanline caches.

After the MCU selects a MU, the MCU reads several special MU registers that contain the actions that the MU wants the MCU to perform. The MCU fully reads these addresses as needed before starting processing the request. The MCU will be capable of several operations.

Reading a sequential range of aligned memory locations from the framebuffer into the MU.
Writing a sequential range of aligned memory location from the MU into the framebuffer.
(not implemented) Read/Modify/Write of framebuffer memory addresses for implementing locking.
?more?

Bus Details

Remember that the MCU is completely in control of the MU bus. MUs are fully reactionary to values driven on the bus by the MCU. Because all MUs see the same mu_bus, they only react to it when they are selected in response to an interrupt they raised.

mu_bus_data

Data width is 32 bits. Read/write by the MCU and MU. Only full dwords are written or read at a time.

mu_bus_address

Address bus width 2 bits. See the MCU request specification for address value meanings.

mu_bus_strobe

When this is high this means the MU should perform the read or write operation to its internally generated address, and increment that internal address. If a read, the MU should present the new data on the mu_bus_data lines for the next clock cycle.

mu_bus_write

The MCU raises this line when the operation it is performing is a write to the MU.

In addition to the bus lines mentioned above, each MU has two lines connecting directly to the MCU.

<MU_name>_irq

The MU raises this line when it wants the MCU to perform an operation on its behalf

<MU_name>_select

The MCU raises this line in response to the irq line raised by the MU when it is starting to service its request. The MCU lowers this line after it is finished servicing the MU's request.

MCU Request Specification

As stated above the MU must define and communicate a behavior that it wants the MCU to do on its behalf. This is done by providing two MCU read only registers in each MU. The MCU, just after selecting a MU will read all of these registers (starting at 1) to configure itself. After reading the MU request specification, the MCU will then set mu_bus_address to 0 and read/write from/to the mu the requested number of dword values in order, starting from address 0. After the MCU completes this operation on behalf of the MU, the MCU deselects the MU by putting its select line to zero.

mu_bus_address	Meaning
0	RW data values
1	Read cmd_def_reg_0 (when mu_bus_write is low)
2	Read cmd_def_reg_1 (when mu_bus_write is low)
others	undefined

cmd_def_reg_0

Bit 31 to 24 the mcu command type

Bit 11 to 0 the number of framebuffer dword addresses to read/write/modify

Right now the only two commands are 0x80 (write) and 0x00 (read) so only bit 31 of the command is used. To restate: 0x80 to write to the framebuffer from the MU and 0x00 to read from the framebuffer into the MU. (apparently shader 'c' code is reversed!!)

cmd_def_reg_1

Bits 31 to 0 hold the 32 bit framebuffer starting address that will be written/read/modified from. Note that the bottom two bits are ignored as only dword aligned and wide framebuffer read/writes are allowed.

MCU Transfer

After reading the MU's cmd_def_reg(s), the MCU is expected to set the mu_bus_address to 0 and begin the main data transfer. The MU is expected to maintain a local counter representing which MU dependent address/location the value written/read should come from/go to. The MCU uses the mu_bus_strobe to indicate to the MU that it should increment its local address counter. A transfer can occur every clock if the strobe line is held high. Note that the strobe line is just used as an address increment, and should no longer be thought about as an async select.

Host Connection

The connection to the host is implemented by a USB bus provided by the Nexus2. This USB bus is interfaced to the command Bus by the usb_io_async.vhd module. USB communication is implemented as a stream of fixed sized packets. In this implementation, messages are prefixed with 0xAA and terminate with 0xFF to assist in framing. The protocol implemented was designed to interface the command bus to the Nexus2's usb chip as simply as possible.

TVC USB Write Packet
Offset	Value	Meaning
0	0xAA	Header
1	0xF0 for CB write 0x0F for CB read	USB Command
2	<varies>	Command Bus Address
3	<varies> for CB write 0 for CB read	Data payload byte 0
4	<varies> for CB write 0 for CB read	Data payload byte 1
5	<varies> for CB write 0 for CB read	Data payload byte 2
6	<varies> for CB write 0 for CB read	Data payload byte 3
7	0xFF	Packet Termination

When the usb_io_async.vhd module receives a CB read command packet, it performs the command bus read and then writes the following packet to the USB bus.

TVC USB Read Packet
Offset	Value	Meaning
0	0xAA	Header
3	<varies>	CB read data payload byte 0
4	<varies>	CB read data payload byte 1
5	<varies>	CB read data payload byte 2
6	<varies>	CB read data payload byte 3
7	0xFF	Packet Termination

Command Bus

The command bus is a 32 bit data 8 bit address bus that is used only to write data into and read data from the drawing functional units. Each functional unit has reserved a certain address space on the command bus that does not overlap with any other functional unit. Functional units are only expected to respond to the command bus when they are idle. Reads and writes to addresses that belong to functional units that are not idle are undefined (but in most cases are only ignored). There is no requirement for the functional units to provide both read and write functionality to all addresses that belong to it on the bus. The functional units are expected to implement only the command bus functionality necessary for their implementation specific operation. Functional units are activated by writing to a special address within each units' address space.

Functional Units

Block Write

Writes a single memory bus aligned 32 bit value to the frame buffer. Functional unit is initiated with a write to address 0x13 on command bus.

Note: The write will be complete when the functional unit becomes idle.

Base Address 0x10

Command Bus Address	Function/name	Note
0x11	Desired frame buffer address	Only highest 30 bits are used
0x12	Value to write
0x13	Start FU	Value on bus not used

Block Read

Reads a single memory bus aligned 32 bit value from the framebuffer. Functional unit is initiated with a write to address 0x23 on the command bus.

Base Address: 0x20

Command Bus Address	Function/name	Note
0x21	Desired frame buffer address	Only highest 30 bits are used
0x22	Address from which fb value can be read
0x23	Start FU	Value on bus not used

Block Set

This command sets a block of consecutive dram aligned values to a certain value. This command is used for clearing the front and depth buffers.

Note: This command only operates on dram aligned data values. This is implemented by treating the least significant two bits of the fb_address and size_in_bytes values as zero. This command as currently implemented can hog the MU data bus. Its design must be changed to break its writes into several/many shorter bursts.

Command byte: 0x03

Base address is 0x00 on the command bus

Command Bus Address	Function/name	Note
0x01	Staring frame buffer address	Only highest 30 bits are used
0x02	Number of registers	Only 23 downto 2 bits are used (got to fix that...)
0x03	Value to write

Poly Scanline

The polygon scan line functional unit implements the most computational intensive part of OpenGL rendering, the innermost rendering loop that deals with the individual pixels in a group that span the width of a triangle called a polygon scan line. If the texture_id is specified as zero, this functional unit assumes the polygon is flat shaded. If the texture_id is non-zero, the functional unit performs texture mapping based on nearest pixel selection.

In this TVC release, the PolyScanline unit is broken. It's future is uncertain as a shader processor can perform the polyscanline operation in software. It is anticipated though that polyscanline units will be attached to the local io bus of the shader processors in order to accelerate rendering.

Base address is 0xF0 on the command bus

Command Bus Address	Function/name	Note
0xF0	scanline_fb_address
0xF1	depthbuffer_fb_address
0xF2	scanline_start_z_value
0xF3	scanline_z_increment
0xF4	scanline_length	Only lowest 16 bits are used
0xF5	pixel_value	Only lowest 8 bits are (presently) used
0xF6	Texture coord x & y	High 16 bits tcx Low 16 bits tcy
0xF7	Texture coord increment x & y	High 16 bits tc_xinc Low 16 bits tc_yinc
0xF8	Texture id	Only lowest 3 bits are used.
0xF9	Start FU	Value on data bus is not used

Shader Processor

In this release the Shader Processor has a 1024 dword data space, and an 8192 dword address space.

The shader control unit is driven by the Control Bus and can halt and reset the shader core as well as read and write to the shader core's local bram memories. This is how programs and data are loaded into the shader instruction and data rams for execution.

The shader CB interface has 9 reserved addresses all prefixed with 'E'

Command Bus Address	Write Meaning	Read Meaning
E0	A 32 bit scratch data register	Same value as write
E1	A 16 bit address for a ram value to be written or read. 32 bit but upper 16 bits are ignored.	Same value as write
E2	Any value written here triggers a write of the value in E0 to instruction ram address E1	Not implemented
E3	Any value written here triggers a read from instruction ram address E1 into E0 register	Not implemented
E4	Any value written here triggers a write of the value in E0 to data ram address E1	Not implemented
E5	Any value written here triggers a read from data ram address E1 into E0 register	Not implemented
E6	Forces a halt of the processor	Status register all zero's except bit 0 which is <halt> status of shader Note/Hack: Upper 16 bits are now actually the program counter.
E7	Clears the halt of the processor, no reset	Not implemented
E8	Resets the processor (and clears halt).	Not implemented

Three examples:

#1 To write to shader instruction ram

CB write E0 <value>
CB write E1 <addr>
CB write E2 <ignored value>

#2 To write to shader data ram

CB write E0 <value>
CB write E1 <addr>
CB write E4 <ignored value>

#3 To read shader data ram

CB write E1 <addr>
CB write E5 <ignored value>
CB read E0

Interactive Simulator

Short shader programs may be executed in the simulator and debugged using breakpoints and memory and register viewers. See Illustration 2.

Illustration 2: TVC CPU interactive simulator/debugger screenshot

Non-Interactive Simulator

Illustration 3: Non-interactive X simulation output window

Much of the TVC may be simulated in a non-interactive simulator. Shown in Illustration 3 is a view of its X window. In this case the simulated shader has been loaded with a texture mapping program, and it is being driven by the same Driver_HL as actual hardware.

RTL Simulator

The shader core can also be simulated by GHDL and render into a simulated framebuffer. Illustration 4 is a sample of the RTL simulation output. This simulation reads commands from a text file to load the memories with input data and start the shader execution. The format of this RTL command text file is 3 8 digit hex dwords per line separated by white spaces.

This 'command line' is based heavily on the needs of the Command Bus as described below.

The command values:

1 write <value> to <address>

2 ignore <value> read <address> and print result to std output (can diff a memory dump)

3 read <address> in a loop compare the result to <value>; don't continue until the read result = value

Command #3 is used to pause 'execution' of the RTL command text file while the shader core is busy executing a shader program. This is accomplished by the command line:

Illustration 4: RTL Simulation X Window Output

00000003 000000E6 00000001

Basically this reads register Control Bus E6 in a loop until the value equals 1 (meaning the processor has halted by hitting the HLT instruction). Note that the compiler inserts a HLT after the CALL of main().

The above is slightly complicated by the fact that the shader processor's program counter is copied into the upper 16 bits of the returned value. So to detect a processor halt state you must compare the masked lower 16 bits to value 1. This PC hack lets you trace program execution while the simulation (and physical hardware) executes.

The used version of GHDL: 0.31 (20140108) takes about 24 hours to simulate 1 second of shader program execution. Waveform viewing was performed using GTKWave version 3.3.62. The combination of Emacs, GHDL and GTKWave formed a very pleasant development environment. Because GHDL does not presently implement mixed mode simulation, and the PSRAM memory model provided by Micron was implemented in Verilog, the full TVC was simulated using ISE's Isim in the final implementation stages.

The RTL output is rendered to the X Server by a dma_server program that listens to named pipes also opened by the GHDL compiled process and reads/writes dma values to a local memory buffer simulating a framebuffer. After a fixed number of DMA writes the dma_server updates the X display.

One additional tool created to debug the RTL was the log_compare tool. Both the VHDL and the instruction level simulator were modified to print out register contents when an instruction is retired and to print out addresses and values of each data ram read and write. The log_compare tool parses these traces in lockstep to determine divergence of the shader processor RTL simulation from the software model of the cpu. This tool was critical to ferret out small bugs in the RTL.

Implementation details

The TVC r5 was implemented on a Nexus2(tm) from Digilent Inc. This development board provides an 8 bit color output via a resistor network. The Nexus2(tm) provides an onboard 16 bit wide PSRAM from Micron that can operate at up to an 80 MHz clock.

The Nexus2(tm) provides a 50MHz clock generator. This is scaled by DCMs to a 25MHz pixel clock and an 80 MHz logic clock.

The USB interface provided a special challenge. Digilent provides a library and firmware for doing high speed data transfer through the Nexus2's USB port. I had little to no interest in building on top of their solutions. Fortunately previous developers have attacked this problem and I was able to leverage their work. The penultimate developer was Joe Rothweiler of sensicomm.com. He created a simple program for the 80C51 core within the CY7C68013A chip that overrides the original Digilent firmware. He also created a small VHDL library for reading and writing to the FIFOs configured by his 80C51 program. I initially had metastability issues with his VHDL codes, and after carefully buffering clocks and signals lost a lot of performance, so I re-created that part from scratch, using only two of the chip's FIFOs. Because his work is GPL3 you will have to download his 80C51 firmware separately from the TVC project.

Joe's work is documented here:

http://www.sensicomm.com/main/projects/fpga/digilent_nexys_usb.shtml

There is certainly room for performance improvements in the USB interface, however after adding buffered writes to the driver and the feature of writing 16 polygons at at time into the TVC's data ram, it does not appear that the USB interface is a performance limit at present.

Illustration 5: Digilent Nexus2(tm) Development Board

ISE was configured for synthesis with optimizing for timing performance but without performing iob packing. Relevant synthesis stats follow.

Feature	Number Used	Number Available	Percent Used
# of slice flip flops	2940	17344	16%
# of 4 input LUTs	5751	17344	33%
# of occupied slices/	3944	8672	45%
# of RAMB16s	24	28	85%
# of DCMs	4	8	50%
# of MULT18X18SIOs	1	28	3%

Programming Toolchain

The assembler is a standard two pass design. The first pass computes addresses and the second pass emits code. The emitted 'binary' file has extension .tbin and is an ascii file with a single 32bit hex encoded number per line. This generated file is intended to be loaded into the shader core's program memory starting at address 0. The assembler can optionally generate a .lst file which is a text file showing the assembled output along with the computed addresses. If the -f flag is passed to the assembler the floating point routine library is included into the output binary. This flag is required if floating point operations are performed in the assembled code.

The compiler is of standard design. It has a lexical analysis phase implemented in flex, and a parser implemented in bison. The parse tree is next transformed into a straight line program and finally assembly code is emitted. There are no optimizations implemented in this release. The gnu CPP is called to pre-process shader programs to get #defines and conditional code inclusion. This allows compiling textured & non-textured shader programs from the same source file.

The compiled language allows this construct:

mem_block 65520

{

uint dma0_sd_addr;

uint dma0_fb_addr;

uint dma1_sd_addr;

uint dma1_fb_addr;

uint dma2_sd_addr;

uint dma2_fb_addr;

uint dma3_sd_addr;

uint dma3_fb_addr;

uint dma_cmd;

};

This mem_block construct defines a set of global variables starting at a fixed address within the shader's data memory. This construct allows the TVC host to pre-load variables' values into a fixed and agreed upon location before program execution; or as in the above example, it allows defining special register addresses to shader programs.

The compiler allows for the following primary data types: float: 32 bit floating point, uint: unsigned 32 bit integer, and sint: signed 32 bit integer.

To execute a shader program the host performs the following operations through the command bus:

Halts the shader core
Loads the instruction ram with an assembled tbin file
Loads any special variable values into the data ram memory
Clears the shader core halt status and resets the processor
Polls for completion of the shader program via the Halt bit going high

Implementation Issues

In this release the polyscanline functional unit is largely nonfunctional. It will be fixed or replaced in future releases.

The compiler, alas, is full of bugs. It will require serious polish to be reliable. The assembler needs a more sophisticated mechanism to look up names / values.

Illustration 6: Photograph of Utah teapot with TVC rendering

Illustration 6 and Illustration 7 are the output of the TVC. The small black inclusions are likely due to bugs in the memory controller or the DMA unit. As these modules will require significant rework in future releases, these bugs were ignored.

Illustration 7: Photograph of texture mapping demo with TVC rendering

The RTL simulation shows significant black holes through the model. These are due to problems in the test bench memory controller tb_mc.vhd that forwards dma requests to and from an external program dma_server that renders the simulated frame buffer contents to the X display. This is an annoyance and will be fixed in future releases. Soon after the RTL began displaying pixels in the simulated framebuffer window, development moved to ISE's Isim and physical hardware, and as a result this simulation framework was neglected.