Designing a CPU in VHDL, Part 15: Introducing RPU

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and Iā€™d recommend they are read before continuing.

It’s been a while. Despite the length of time and lack of posts, rest assured a significant amount of progress has been made on my VHDL CPU over the last year. I’ve hinted at that fact multiple times on twitter, as various project milestones have been hit. But what’s changed?

First and foremost; the CPU now consumes RISC-V. It’s decoder, ALU and datapaths have been updated. With that, the data width is now 32 bit. The Decoder accepts the RV32I base integer instruction set.

I’d been putting off multiple side-projects with my existing TPU implementation for a while. It’s 16-bit nature really made integrating the 32MB SDRAM into the system a rather pointless affair. I’m all for reminiscing but I did not want to go down the rabbit hole of memory bank switching and the baggage that would entail. The toolchain for creating software was already at the limit – not the limit in terms of what could be done – but the limit in what I’m prepared to do in order to perform basic tasks. We all love bare metal assembly, but for the sake of my own free time, I wanted to just drop some C in, and I was not going to make my own compiler. I looked into creating a backend for LLVM, but it’s really just another distraction.

As a reminder, here is where we left off.

Block diagram of TPU CoreSo, what’s actually changed from the old 16-bit TPU?

  • Many more registers,
  • Datapath widened to 32-bit
  • Decoder and ALU updates for RV32I ISA
  • Glue logic/datapath updated for new functions

The register file is basically the exact same as before with the 16-bit TPU, extended for 32 entries of 32 bits names x0-x31. In RV32I, x0 is hardwired to 0x00000000. I’ve left it a real register entry, but just never allow a write to x0 to progress, keeping the entry always zero. The decoder and ALU is fairly standard really – there are a small number of instruction formats, and where immediates are concerned there is some bit swapping here and there, but the sign bits are always in the same place which can make things a bit easier in terms of making the decode logic easier to understand.

The last item about datapath changes follows on from how RISC-V branch instructions operate. TPU was always incredibly simple, which meant branching was quite an involved process when it came to function calling and attempting to get a standardized calling convention. There was no call instruction, and the amount of operations required for saving a calculated return from function address, setting up call arguments on the stack and eventually calling via an indirect register was rather irritating. With the limited amount of registers available on TPU, simplification via register parameter passing didn’t solve much as you would be required to save/restore register contents to the stack regardless.

RISC-V JAL instruction definition screenshot from ISARISC-V has several branch instructions with significant changes to dataflow from that of TPU. In addition to calculating the branch target – which is all TPU was able to do, RISC-V calculates a return address at PC+4, which is then written to a register. This means our new ALU needs two outputs, the standard result of a calculation, and the branch target. The branch target with the shouldBranch output status feeds into our existing PC unit, with the result feeding as normal the register file/memory address logic. The new connections are shown below in yellow.

RPU block diagramIn terms of old TPU systems, the old interrupt system needs significant updating and so is disabled. It’s still got block ram storage, UART, and relies on the same underlying memory subsystem, which in all honesty is super bloated and pretty poor. The memory system is currently my main focus – it’s still made up from a single large process at the CPU core clock. It reads like an imperative language function, which is not at all suitable for the CPU moving forward. The CPU needs to interface with various different components, from the block rams, UART, to SPI and SDRAM controllers. These can be running at different clocks and the signals all need to remain valid across these systems. With everything being accessed as memory-mapped IO, the memory system is super important, and I’ve already run into several gotchas when extending it with an SDRAM controller. More information on that later.

Toolchain

As I mentioned earlier, one of the main reasons for moving to RISC-V was toolchain considerations. I have been developing the software for my ‘soc’ with GCC.

With Windows 10, you can now use the Linux Subsystem for Windows to build linux tools within Windows. This makes compiling the RISC-V toolchain for your particular ISA variant a super simple process.

https://riscv.org/software-tools/ has some details of how to build relevant toolchains, and Microsoft have details of how to install the Linux Subsystem.

Data Storage

With the RISC-V GCC toolchain built, targeting RV32I, I was able to write a decent amount of code in order to create a bootloader which existed inside the FPGA block ram. This bootloader looked for an SD card in the miniSpartan6+ SD card slot. It initialized the SD card into slow SPI transfer mode so we could get at it’s contents.

Actually getting to data stored on an SD card is incredibly simple, and has been written up nicely here. In terms of how the CPU accessed the SD card, I found a SPI master VHDL module and threw it into my project, and memory mapped it so you could use it from the CPU.

#define SPI_M1_CONFIG_REG_ADDR 0x00009300
#define SPI_M1_BUSY_REG_ADDR   0x00009304
#define SPI_M1_DATA_REG_ADDR   0x00009308

void spi_sd_reset()
{
	volatile uint32_t* spi_config_reg = (uint32_t*)SPI_M1_CONFIG_REG_ADDR;

	uint32_t current_reg = *spi_config_reg;
	uint32_t reset_bit = 1U << SPI_CONFIG_REG_BIT_RESET;
	uint32_t reset_bitmask = ~reset_bit;

	// Multiple writes gives the controller time to reset - it's clock
	// can be slower than this CPU
	*spi_config_reg = current_reg & reset_bitmask;
	*spi_config_reg = current_reg & reset_bitmask;
	*spi_config_reg = current_reg & reset_bitmask;
	*spi_config_reg = current_reg & reset_bitmask;

	// return to original, but ensure the reset bit is set (active low), 
	// in case the register was previously clobbered by some other operation
	*spi_config_reg = current_reg | reset_bit;
}

uint8_t spi_sd_xchg(uint8_t dat)
{
	volatile uint32_t* spi_busy_reg = (uint32_t*)SPI_M1_BUSY_REG_ADDR;
	volatile uint32_t* spi_data_reg = (uint32_t*)SPI_M1_DATA_REG_ADDR;

	*spi_data_reg = (uint32_t)dat;

	while (*spi_busy_reg != 0);

	return *spi_data_reg;
}

A few helper functions later, the bootloader did a very simple FAT32 root search for a file called BOOT, and copied it into ram before jumping to it. It also copied a file called BIOS to memory location 0x00000000 – which had a table of I/O functions so I could fix/extend functionality without needing to recompile my “user” code.

typedef struct {
    FN_sys_console_put_stringn    sys_console_put_stringn;
    FN_sys_console_put_string     sys_console_put_string;
    FN_sys_console_setcursor      sys_console_setcursor;
    FN_sys_console_setpen         sys_console_setpen;
    FN_sys_console_clear          sys_console_clear;
    FN_sys_console_getcursor_col  sys_console_getcursor_col;
    FN_sys_console_getcursor_row  sys_console_getcursor_row;
    FN_sys_console_getpen         sys_console_getpen;
    FN_sys_clk_get_cyclecount     sys_clk_get_cyclecount;
    
    FN_spi_sd_set_clkdivider      spi_sd_set_clkdivider;
    FN_spi_sd_set_deviceaddr      spi_sd_set_deviceaddr;
    FN_spi_sd_xchg                spi_sd_xchg;
    FN_spi_sd_reset               spi_sd_reset;
    
    FN_uart_tx1_send              uart_tx1_send;
    FN_uart_rx1_recv              uart_rx1_recv;
    FN_uart_trx1_set_baud_divisor uart_trx1_set_baud_divisor;
    FN_uart_trx1_get_baud_divisor uart_trx1_get_baud_divisor;
    
//...
    
// Above functions are basic bios and provided by BIOS.c/BIN
// Below are extensions, so null checks must be performed to check 
// initialization state
    FN_pf_open    pf_open;
    FN_pf_read    pf_read;
    FN_pf_opendir pf_opendir;
    FN_pf_readdir pf_readdir;
} SYSCALL_TABLE;

With the FPGA set to look for this BOOT file off an SD card, testing various items became a whole lot easier. Not having to change the block ram and rebuild the FPGA configuration to run a set of tests saved lots of time, and I had a decent little setup going, however I was severely limited by RAM – many block rams were now in use, and yet I still did not have enough RAM. I could have optimized for space here and there, but when I have a 32MB SDRam chip on the FPGA dev board it seemed rather pointless. It was finally time to get the SDRAM integrated.

SDRAM

Getting an SDRAM interface controller for the chip on the miniSpartan6+ was very easy. Mike Field had done this already and made his code available. The problem was that whatever I tried, the data out from the SDRAM was garbage.

Or, should I say, out by one. Everything to do with computers has an out by one at some point.

Anyway, data was always arriving delayed by one request. If I asked to read data at 0x0000100, I’d get 0. If I asked for data at 0x0000104, I’d get the data at 0x00000100. I could tell that writing seemed to be fine – but reading was broken. Thankfully discovering this didn’t take too much time, due to being able to boot code off the SD card. I wrote little memory checking binaries and executed them via a UART command line, the debug data being displayed on the HDMI out. It was pretty cool, despite me getting wound up at how broken reading from the SDRAM was.

memory testIt took a long time to track down the actual cause of this read corruption. For the most part, I thought it was timing based in the SDRAM controller. I edited the whole memory system multiple times, with no changes to the behavior. This should have been a red flag for me, as the issue was being caused by the constant through all of these tests – the RPU core!

In RPU, the signals that comprise the input and output of each pipeline stage get passed around and moved from unit to unit each core clock. This is where my issue was – not the “external” memory controller!

Despite my CPU being okay with waiting for the results of memory reads from a slower device, it was not set up for correctly forwarding that data on when the device eventually said the request had completed. A fairly basic error on my part, made harder to track down by the fact the old TPU CPU design and thus RPU had memory requests going through the ALU, then control unit, and some secondary smaller memory controller which didn’t really do anything apart from add further latency to the request.

I fixed this, which immediately made all the SDRAM memory space available. Hurrah.

It’s rather annoying this issue did not present itself sooner. There are already many “slow” memory mapped devices connected to the CPU core, but they generally work off of FIFOs – so despite them being slow, they are accessed via queries to lines to say whether there is data in the FIFO, and what the data is. These reads are actually pretty fast, the latency is eaten elsewhere – like spinning on a memory read. Basic SDRAM read latency however for this was in the 20 cycle range at 100MHz, way more than previously tested.

Having the additional memory was great. Now that we have a full GCC toolchain, I could grab the FatFs library, and have a fully usable filesystem mounted from the SD card. Listing directories and printing the contents of files seemed like such a milestone.

Next steps

So that’s where we are at. However, things be changing. I have switched over to Xilinx Spartan 7, using a Digilent Arty S7 board (the larger XC7S50 variant). I’m currently porting my work over to it.

If you saw my last blog post, you’d have seen that HDMI out at least works, and now I’m working on porting the CPU over, using block rams and SD card for memory/storage. Lastly will be a DDR3 memory controller. 256MB of 650MHz memory gives much more scope for messing up my timing even more. It should be great fun!

I hope to document progress more frequently than over the past year. These articles are my projects documentation, and provides an outlet to confirm I actually understand my solutions, so the lack of posts has made things all the more disjoint. The next post shall be soon!

Thanks for reading. Let me know of any queries on twitter @domipheus.