Designing a CPU in VHDL, Part 15: Introducing RPU

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and Iā€™d recommend they are read before continuing.

It’s been a while. Despite the length of time and lack of posts, rest assured a significant amount of progress has been made on my VHDL CPU over the last year. I’ve hinted at that fact multiple times on twitter, as various project milestones have been hit. But what’s changed?

First and foremost; the CPU now consumes RISC-V. It’s decoder, ALU and datapaths have been updated. With that, the data width is now 32 bit. The Decoder accepts the RV32I base integer instruction set.

I’d been putting off multiple side-projects with my existing TPU implementation for a while. It’s 16-bit nature really made integrating the 32MB SDRAM into the system a rather pointless affair. I’m all for reminiscing but I did not want to go down the rabbit hole of memory bank switching and the baggage that would entail. The toolchain for creating software was already at the limit – not the limit in terms of what could be done – but the limit in what I’m prepared to do in order to perform basic tasks. We all love bare metal assembly, but for the sake of my own free time, I wanted to just drop some C in, and I was not going to make my own compiler. I looked into creating a backend for LLVM, but it’s really just another distraction.

As a reminder, here is where we left off.

Block diagram of TPU CoreSo, what’s actually changed from the old 16-bit TPU?

  • Many more registers,
  • Datapath widened to 32-bit
  • Decoder and ALU updates for RV32I ISA
  • Glue logic/datapath updated for new functions

The register file is basically the exact same as before with the 16-bit TPU, extended for 32 entries of 32 bits names x0-x31. In RV32I, x0 is hardwired to 0x00000000. I’ve left it a real register entry, but just never allow a write to x0 to progress, keeping the entry always zero. The decoder and ALU is fairly standard really – there are a small number of instruction formats, and where immediates are concerned there is some bit swapping here and there, but the sign bits are always in the same place which can make things a bit easier in terms of making the decode logic easier to understand.

The last item about datapath changes follows on from how RISC-V branch instructions operate. TPU was always incredibly simple, which meant branching was quite an involved process when it came to function calling and attempting to get a standardized calling convention. There was no call instruction, and the amount of operations required for saving a calculated return from function address, setting up call arguments on the stack and eventually calling via an indirect register was rather irritating. With the limited amount of registers available on TPU, simplification via register parameter passing didn’t solve much as you would be required to save/restore register contents to the stack regardless.

RISC-V JAL instruction definition screenshot from ISARISC-V has several branch instructions with significant changes to dataflow from that of TPU. In addition to calculating the branch target – which is all TPU was able to do, RISC-V calculates a return address at PC+4, which is then written to a register. This means our new ALU needs two outputs, the standard result of a calculation, and the branch target. The branch target with the shouldBranch output status feeds into our existing PC unit, with the result feeding as normal the register file/memory address logic. The new connections are shown below in yellow.

RPU block diagramIn terms of old TPU systems, the old interrupt system needs significant updating and so is disabled. It’s still got block ram storage, UART, and relies on the same underlying memory subsystem, which in all honesty is super bloated and pretty poor. The memory system is currently my main focus – it’s still made up from a single large process at the CPU core clock. It reads like an imperative language function, which is not at all suitable for the CPU moving forward. The CPU needs to interface with various different components, from the block rams, UART, to SPI and SDRAM controllers. These can be running at different clocks and the signals all need to remain valid across these systems. With everything being accessed as memory-mapped IO, the memory system is super important, and I’ve already run into several gotchas when extending it with an SDRAM controller. More information on that later.


As I mentioned earlier, one of the main reasons for moving to RISC-V was toolchain considerations. I have been developing the software for my ‘soc’ with GCC.

With Windows 10, you can now use the Linux Subsystem for Windows to build linux tools within Windows. This makes compiling the RISC-V toolchain for your particular ISA variant a super simple process. has some details of how to build relevant toolchains, and Microsoft have details of how to install the Linux Subsystem.

Data Storage

With the RISC-V GCC toolchain built, targeting RV32I, I was able to write a decent amount of code in order to create a bootloader which existed inside the FPGA block ram. This bootloader looked for an SD card in the miniSpartan6+ SD card slot. It initialized the SD card into slow SPI transfer mode so we could get at it’s contents.

Actually getting to data stored on an SD card is incredibly simple, and has been written up nicely here. In terms of how the CPU accessed the SD card, I found a SPI master VHDL module and threw it into my project, and memory mapped it so you could use it from the CPU.

#define SPI_M1_CONFIG_REG_ADDR 0x00009300
#define SPI_M1_BUSY_REG_ADDR   0x00009304
#define SPI_M1_DATA_REG_ADDR   0x00009308

void spi_sd_reset()
	volatile uint32_t* spi_config_reg = (uint32_t*)SPI_M1_CONFIG_REG_ADDR;

	uint32_t current_reg = *spi_config_reg;
	uint32_t reset_bit = 1U << SPI_CONFIG_REG_BIT_RESET;
	uint32_t reset_bitmask = ~reset_bit;

	// Multiple writes gives the controller time to reset - it's clock
	// can be slower than this CPU
	*spi_config_reg = current_reg & reset_bitmask;
	*spi_config_reg = current_reg & reset_bitmask;
	*spi_config_reg = current_reg & reset_bitmask;
	*spi_config_reg = current_reg & reset_bitmask;

	// return to original, but ensure the reset bit is set (active low), 
	// in case the register was previously clobbered by some other operation
	*spi_config_reg = current_reg | reset_bit;

uint8_t spi_sd_xchg(uint8_t dat)
	volatile uint32_t* spi_busy_reg = (uint32_t*)SPI_M1_BUSY_REG_ADDR;
	volatile uint32_t* spi_data_reg = (uint32_t*)SPI_M1_DATA_REG_ADDR;

	*spi_data_reg = (uint32_t)dat;

	while (*spi_busy_reg != 0);

	return *spi_data_reg;

A few helper functions later, the bootloader did a very simple FAT32 root search for a file called BOOT, and copied it into ram before jumping to it. It also copied a file called BIOS to memory location 0x00000000 – which had a table of I/O functions so I could fix/extend functionality without needing to recompile my “user” code.

typedef struct {
    FN_sys_console_put_stringn    sys_console_put_stringn;
    FN_sys_console_put_string     sys_console_put_string;
    FN_sys_console_setcursor      sys_console_setcursor;
    FN_sys_console_setpen         sys_console_setpen;
    FN_sys_console_clear          sys_console_clear;
    FN_sys_console_getcursor_col  sys_console_getcursor_col;
    FN_sys_console_getcursor_row  sys_console_getcursor_row;
    FN_sys_console_getpen         sys_console_getpen;
    FN_sys_clk_get_cyclecount     sys_clk_get_cyclecount;
    FN_spi_sd_set_clkdivider      spi_sd_set_clkdivider;
    FN_spi_sd_set_deviceaddr      spi_sd_set_deviceaddr;
    FN_spi_sd_xchg                spi_sd_xchg;
    FN_spi_sd_reset               spi_sd_reset;
    FN_uart_tx1_send              uart_tx1_send;
    FN_uart_rx1_recv              uart_rx1_recv;
    FN_uart_trx1_set_baud_divisor uart_trx1_set_baud_divisor;
    FN_uart_trx1_get_baud_divisor uart_trx1_get_baud_divisor;
// Above functions are basic bios and provided by BIOS.c/BIN
// Below are extensions, so null checks must be performed to check 
// initialization state
    FN_pf_open    pf_open;
    FN_pf_read    pf_read;
    FN_pf_opendir pf_opendir;
    FN_pf_readdir pf_readdir;

With the FPGA set to look for this BOOT file off an SD card, testing various items became a whole lot easier. Not having to change the block ram and rebuild the FPGA configuration to run a set of tests saved lots of time, and I had a decent little setup going, however I was severely limited by RAM – many block rams were now in use, and yet I still did not have enough RAM. I could have optimized for space here and there, but when I have a 32MB SDRam chip on the FPGA dev board it seemed rather pointless. It was finally time to get the SDRAM integrated.


Getting an SDRAM interface controller for the chip on the miniSpartan6+ was very easy. Mike Field had done this already and made his code available. The problem was that whatever I tried, the data out from the SDRAM was garbage.

Or, should I say, out by one. Everything to do with computers has an out by one at some point.

Anyway, data was always arriving delayed by one request. If I asked to read data at 0x0000100, I’d get 0. If I asked for data at 0x0000104, I’d get the data at 0x00000100. I could tell that writing seemed to be fine – but reading was broken. Thankfully discovering this didn’t take too much time, due to being able to boot code off the SD card. I wrote little memory checking binaries and executed them via a UART command line, the debug data being displayed on the HDMI out. It was pretty cool, despite me getting wound up at how broken reading from the SDRAM was.

memory testIt took a long time to track down the actual cause of this read corruption. For the most part, I thought it was timing based in the SDRAM controller. I edited the whole memory system multiple times, with no changes to the behavior. This should have been a red flag for me, as the issue was being caused by the constant through all of these tests – the RPU core!

In RPU, the signals that comprise the input and output of each pipeline stage get passed around and moved from unit to unit each core clock. This is where my issue was – not the “external” memory controller!

Despite my CPU being okay with waiting for the results of memory reads from a slower device, it was not set up for correctly forwarding that data on when the device eventually said the request had completed. A fairly basic error on my part, made harder to track down by the fact the old TPU CPU design and thus RPU had memory requests going through the ALU, then control unit, and some secondary smaller memory controller which didn’t really do anything apart from add further latency to the request.

I fixed this, which immediately made all the SDRAM memory space available. Hurrah.

It’s rather annoying this issue did not present itself sooner. There are already many “slow” memory mapped devices connected to the CPU core, but they generally work off of FIFOs – so despite them being slow, they are accessed via queries to lines to say whether there is data in the FIFO, and what the data is. These reads are actually pretty fast, the latency is eaten elsewhere – like spinning on a memory read. Basic SDRAM read latency however for this was in the 20 cycle range at 100MHz, way more than previously tested.

Having the additional memory was great. Now that we have a full GCC toolchain, I could grab the FatFs library, and have a fully usable filesystem mounted from the SD card. Listing directories and printing the contents of files seemed like such a milestone.

Next steps

So that’s where we are at. However, things be changing. I have switched over to Xilinx Spartan 7, using a Digilent Arty S7 board (the larger XC7S50 variant). I’m currently porting my work over to it.

If you saw my last blog post, you’d have seen that HDMI out at least works, and now I’m working on porting the CPU over, using block rams and SD card for memory/storage. Lastly will be a DDR3 memory controller. 256MB of 650MHz memory gives much more scope for messing up my timing even more. It should be great fun!

I hope to document progress more frequently than over the past year. These articles are my projects documentation, and provides an outlet to confirm I actually understand my solutions, so the lack of posts has made things all the more disjoint. The next post shall be soon!

Thanks for reading. Let me know of any queries on twitter @domipheus.

HDMI over Pmod using the Arty Spartan 7 FPGA board

tl;dr: This post shows that driving DVI-D over an HDMI cable, directly connected to the High Speed Pmod connector of Digilents Arty S7 board, is very much possible- even at high resolution.

I’ve been working away on my RISC-V FPGA based computer ‘kit’, which is based on my VHDL CPU: ported to RISC-V. I wanted to get a new development board with faster ram, and found it hard to find boards with DDR3 memory, a large enough FPGA, SD card interface, and HDMI out.

The SD card was not really a problem – it’s low speed, you can just connect it with slow SPI I/O. HDMI is certainly considered high speed – bandwidth across the 4 serial channels top 4.4Gbps – but is driving HDMI though basic I/O interfaces possible?

It most certainly is!

A caveat, though: There is no protection circuitry. Do this at your own risk šŸ™‚

The Digilent Arty S7 board can be had for sub Ā£100, and my XC7S50 variant cost Ā£119 shipped. It has DDR3, but no on-board HDMI. However, it turns out you can get perfectly acceptable (for my needs, anyway) output from the high-speed Pmod I/O ports. This will go very nicely with the small USB powered HDMI screens you can find, which are very handy.Ā Image showing development board conencted via breadboard wires to an HDMI connector breakout board, which is driving a small HDMI display. I have put a 720p output test on github. It should load in Xilinx Vivado out of the box. You need to connect an HDMI cable to the Pmod, by either splicing a cable, or getting something like the Adafruit breakout board (seen above). It connects to Pmod JA, circled.

Image showing what I/O port to use.In the constraints file for the project, the Pmod pins are specified to use the TMDS_33 standard. The pinout is defined as follows:

A diagram showing what Pmod pins are for what HDMI purpose.The VHDL code in the example uses a simplified 1280x720x60 pixel clock of 75MHz – not the required 74.25MHz as per the standard. Due to this, some TVs/monitors may not accept the signal. It runs fine on my Acer XF240H and Samsung LU28E590DS using a 1m cable. I have not tried a television – they tend to be more picky. You can get much closer to 74.25MHz by chaining clock generators, but I have not done that in this instance. The refresh rate reports as 61Hz with this clock, likely due to it being out of spec.

If you want to learn more about how my example project generates a HDMI/DVI-D signal, you can find details here, which is part of my Designing a CPU in VHDL blog series.

Picture showing a monitor displaying a 720p signal.A video of it in action, using timings applicable for my 800×480 module.

You can change the pixel clock and the vga timings (sync starts, ends, porches ) in the code to generate different resolutions. For 1080p60Hz the pixel rate is supposed to be 148.5MHz, but again my monitor will accept a rate of 150MHz, and show 61Hz.

Even 1440p30 is possible. 1440p60 did not work, but I didn’t try hard to get a more accurate pixel clock in that instance. When you get to higher clocks, you can start to get timing constraint issues. At 1080p and 1440p I had some timing failures listed in the implementation report, but they did run. If you were using this in a real system, you’d have to fix those timing issues. That’s out of the scope of this blog, though šŸ™‚

So there you have it. You really can bodge HDMI/DVI-D output directly through a Pmod!

Thanks for reading. Let me know what you think on twitter @domipheus.