Designing a RISC-V CPU in VHDL, Part 18: Control and Status Register Unit

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

You may remember after the switch from my own TPU ISA to RISC-V was made, I stated that interrupts were disabled. This was due to requirements for RISC-V style interrupt mechanisms not being compatible with what I’d done on TPU. In order to get Interrupts back into RPU, we need to go and implement the correct Control and Status Registers(CSRs) required for management of interrupts – enable bits, interrupt vectors, interrupt causes – they are all communicated via CSRs. As I want my core to execute 3rd-party code, this needs to be to spec!

In the time it’s taken me to start writing this article after doing the implementation, the latest draft RISC-V spec has moved the CSR instructions and definitions into it’s own Zicsr ISA extension. The extension defines 6 additional instructions for manipulating and querying the contents of these special registers, registers which define things such as privilege levels, interrupt causes, and timers. Timers are a bit of an oddity, which we’ll get back to later.

The CSRs within a RISC-V hardware thread (known as a hart) can influence the execution pipeline at many stages. If interrupts are enabled via enable bits contained in CSRs, additional checks are required after instruction writeback/retire in order to follow any pending interrupt or trap. One very important fact to take away from this is that the current values of particular CSRs are required to be known in many locations at multiple points in the instruction pipeline. This is a very real issue, complicated by differing privilege levels having their own local versions of these registers, so throughout designing the implementation of the RPU CSR unit, I wanted a very simple solution which worked well on my unpipelined core – not necessarily the best or most efficient design. This unit has been rather difficult to create, and the solution may be far from ideal. Of all the units I’ve designed for RPU, this is the one which I am continually questioning and looking for redesign possibilities.

The CSRs are accessed via a flat address space of 4096 entries. On our RV32I machine, the entries will be 32 bits wide. Great; lets use a block ram I hear you scream. Well, it’s not that simple. For RPU, the amount of CSRs we require is significantly less than the 4096 entries – and implementing all of them is not required. Whilst I’m sure some people will want 16KB of super fast storage for some evil code optimizations, we are very far from looking at optimizing RPU for speed – so, the initial implementation will look as follows:

  • The CSR unit will take operations as input, with data and CSR address.
  • A signal exists internally for each CSR we support.
  • The CSR unit will output current values of internal CSRs as required for CPU execution at the current privilege level.
  • The CSR unit will be multi-cycle in nature.
  • The CSR unit can raise exceptions for illegal access.

So, our unit will have signals for use when executing user CSR instructions, signals for notifying of events which auto-update registers, output signals for various CSR values needed mid-pipeline for normal CPU operation, and finally – some signals for raising exceptions – which will be needed later – when we check for situations like write operations to read-only CSR addresses.

The operations and CSR address will be provided by the decode stage of the pipeline, and take the 6 different CSR manipulation instructions and encode that into specific CSR unit control operations.

You can see from the spec that the CSR operations are listed as atomic. This does not really effect RPU with the current in-order pipeline, but may do in the future. The CSRRS and CSRRC (Atomic Read and [Set/Clear] Bits in CSR) instructions are read-modify-write, which instantly makes our CSR unit capable of multi-cycle operations. This will be the first pipeline stage that can take more than one cycle to execute, so is a bit of a milestone in complexity. Whilst there are only six instructions, specific values of rd and r1 can expand them to both read and write in the same operation.

All of the CSR instructions fall under the wide SYSTEM instruction type, so we add another set of checks in our decoder, and output translated CSR unit opcodes. The CSR Opcodes are generally just the funct3 instruction bits, with additional read/write flags. There are no real changes to the ALU – our CSR unit takes over any work for these operations.

The RISC-V privileged specification is where the real CSR listings exist, and the descriptions of their use. A few CSRs hold basic, read-only machine information such as vendor, architecture, implementation and hardware thread (hard) IDs. These CSRs are generally accessed only via the csr instructions, however some exist that need accessed at points in the pipeline. The Machine Status Register (mstatus) contains privilege mode and interrupt enable flags which the control unit of the CPU needs constant access to in order to make progress through the instruction pipe.

The CSR Unit

With the input and outputs requirements for the unit pretty clear, we can get started defining the basis for how our unit will be implemented. For RPU, we’ll have an internal signal for each relevant CSR, or manually hard-wire to values as required. Individual process blocks will be used for the various tasks required:

  • A “main” process block which handles any requested CSR operation.
  • A series of “update” processes which write automatically-updated CSRs, such as the instruction retired and cycle counters.
  • A “protection” process which will flag unauthorized access to CSR addresses.

At this point, we have not discussed unauthorized access to CSRs. The interrupt mechanism is not in place to handle them at this point in the blog, however, we will still check for them. Handily, the CSR address actually encodes read/write/privilege access requirements, defined by the following table:

So detecting and raising access exception signals when the operation is at odds with the current privilege level or address type is very simple. If it’s a write operation, and the address bits [11:10] == ’11’ then it’s an invalid access. Same with the privilege checks against the address bits in [9:8].

The “update” processes are, again, fairly simple.

The main process is defined as the following state machine.

As you can see, reads are quicker as they do not fully go into the modify/write phase. A full read-modify-write CSR operation will take 3 cycles. The result of the CSR read will be available 1 cycle later – the same latency as our current ALU operations. As the current latency of our pipeline means it’s over 2 cycles before the next CSR read could possibly occur, so we can have our unit work concurrently and not take up any more than the normal single cycle of our pipeline.

The diagram above shows the tightest possible pipeline timing for a CSR ReadModifyWrite, followed by a CSR read – and in reality the FETCH stage takes multiple cycles due to our memory latency, so currently it’s very safe to have the CSR unit run operations alongside the pipeline. As a quick aside, interrupts from here would only be raised on the first cycle when the CSR Operation was first seen, so they would be handled at the correct stages and not leak into subsequent instructions due to the unit running past the current set of pipeline stages.

A run in our simulation shows the expected behavior, with the write to CSR mtvec present after the 3 cycle write latency.

And the set bit operations also work:

Branching towards confusion

An issue arose after implementing the CSR unit into the existing pipeline, one which caused erroneous branching and forward progress when a CSR instruction immediately followed a branch. It took a while to track this down, although in hindsight with a better simulator testbench program it would have been obvious – which I will show:

Above is a shot of the simulator waveform, centered on a section of code with a branch instruction targeted at a CSR instruction. Instead of the instruction following the cycle CSR read being the cycleh read, there is a second execution of cycle CSR read, and then a branch to 0x00000038. The last signal in the waveform – shouldBranch – is the key here. The shouldBranch signal is controlled by the ALU, and in the first implementation of the CSR unit, the ALU was disabled completely if an CSR instruction was found by the decoder. This meant the shouldBranch signal was not reset after the previous branch (0x0480006f – j 7c) executed. What really needs to happen, is the ALU remain active, but understand that it has no side effects when a CSR operation is in place. Doing this change, results in the following – correct execution of the instruction sequence.


I’ve not really explained all of the CSR issues present in RISC-V during this part of the series. There is a whole section in the spec on Field Specifications, whereby bits within CSRs can behave as though there are no external side effects, however, bits can be hard wired to 0 on read. This is so that field edits which are currently reserved in an implementation will be able to run in future without unexpected consequences. Thankfully, while I did not go into the specifics here, my implementation allows for these operations to be implemented to spec. It’s just a matter of time.

This turned into quite a long one to write, on what I thought would be a fairly simple addition to the CPU. It was anything but simple! You can see the current VHDL code for the unit over on github.

Exceptions/Interrupts are coming next time, now that the relevant status registers are implemented. We’ll also touch on those pesky timer ‘CSRs’.

Thanks for reading, as always – feel free to ask any follow up questions to me on twitter @domipheus.

Designing a RISC-V CPU in VHDL, Part 17: DDR3 Memory Controller, Clock domain crossing

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

In the last part we got to the point where RISC-V code, built with GCC, could run and display text over HDMI and blink LEDs. However, this could only run from the 192KB of Block RAM we initialized within the Spartan7 FPGA on our Digilent Arty S7-50 board. Whilst 192KB is a nice amount of on-FPGA fast storage, we have a 256Mbyte DDR3 chip sitting next to the FPGA which is crying out for use. This post follows on from part 16 and integrates a DDR3 memory controller provided by Xilinx, and using an SD card Pmod adapter, load code from the SD card into that memory for use.

The DDR3 memory chip on the ArtyS7 seems to vary from board to board, but the timing specifications should be compatible for all. For clarity, the chip on my development board is a PMF511816EBR-KADN.

It connects with address and data signals direct to the FPGA. There are many other control signals, but we are not going to get into how DDR3 memory works in this post, as Xilinx Vivado comes with a wizard for generating memory interfaces. The Memory Interface Generator (MIG) seems to be an older utility and I had a few problems with it. However, in the end, I did get a working interface generated.

Memory Interface Generator

Digilents documentation for the ArtyS7 board includes various files for importing into MIG to assist with generation of the memory interface. In MIG itself, you will not see an option for importing settings explicitly – you need to select that you are verifying an existing design. Before the following section walking through the wizard, ensure your Vivado project is located in a local drive – I generally worked on a mapped network drive, and the build will fail to implement the controller at synthesis time if the project is on a network drive. Another item to note – you must go through the IP Catalog. When I tried to I initiate the MIG through the Block Design view that I used for generating BRAM in the previous article, VHDL solutions could not be generated.

We go through the wizard, using the Digilent files to configure the memory controller.Generally it is a case of hitting next though each screen, except from a few pages at the start where you need to input the file paths, and validate the pins – but the validation should succeed with a few warnings.

The next screen was quite something, with the wizard saying the browse buttons for file paths were not supported on Windows 10.

Validating the pins above will only generate a few warnings. Click next through the rest of the wizard and generate the design.

When the implementation of the memory controller has completed, which took minutes on a Ryzen 2700X+SSD, we can see it in the IP pane. At this point, you can open an example project by right clicking the IP in the hierarchy.

The example project can run in the simulator very slowly (initialization took many seconds in the waveform before real simulation begins). You can see the VHDL component definition to use for the controller, and I copied that into the RPU project.

The generated design uses the User Interface(UI) implementation to the memory controller, which preserves things like request ordering. There is a Xilinx document UG586 which details the signals used in this interface, including example waveforms for various read and write request states.

The UI used to utilize the memory controller is good for the use cases that I want: request ordering is preserved, burst reads/writes, writes with byte enable. I originally wanted to use the Native interface – which you can read about in document UG586 if you want to know more – but what I wanted was very basic. The native interface is significantly more involved to use as things like out of order request completion can occur.

In addition to the main data and control busses you’d expect, there are a few other signals which need connected in VHDL for the memory controller to operate. On ArtyS7, we need to feed it both 100 and 200MHz clocks. Additionally, we need to ensure the controller is not used after a reset until the initialization complete signal is asserted from the memory controller internals. One other major item of importance – the controller must be driven off of a ¼ memory interface clock, which for us is 81.25MHz. This is a good bit down on our current 200MHz CPU clock, but for now I just ran everything at this clock frequency, which is provided by the controller itself as an output.

The documentation from Xilinx shows how read and write transactions operate. The interface to the controller supports multiple back to back reads and writes for highest performance, but we only want basic single requests.

Writes follow this pattern:

Reads follow this:

We can implement these easily in our VHDL using the already existing MEM_proc memory request process. We will just add some new states to implement the various stages of a new state machine, and we should be good. The writes can be implemented in one of three ways, as explained in the above waveform diagrams. We are using method 1, asserting the *wdf signals for the write command data at the same time as the initiation of the commands using app_cmd.

As briefly mentioned earlier, we need to handle resetting of the memory controller. We activate the reset signals for the memory controller and CPU core for the same length of time. This is implemented with a decreasing 100MHz counter, initialized to some value in the thousands, with the resets being deactivated at 0. The DDR3 physical interface always initiates a calibration sequence after reset. When this operation is complete, a controller output signal is asserted which we should check for before sending memory requests.

Whilst making the requests and dealing with those states was straightforward, addressing and the resultant data output ordering was far from easy. I knew from MIG documentation that this controller would address 16 bits, so we just need to lop off the least significant bit from our memory request before passing it to the memory controller UI. However, after writing a small test program – which just wrote a known, incrementing pattern to contiguous memory addresses – it was obvious that my understanding of something else was awry. The data seemed to write 32 bits okay, but reading at different address offsets was quite the challenge. The Xilinx documentation does not explain the data organisation for the 128-bit data signals which come out of the controller. I assumed a simple burst read of [addr, addr+1, addr+2] data would present itself, but this didn’t seem to be the case. I memory mapped the internal signals in and out of the Memory Controller, and wrote another test which read and wrote to DDR3 memory, and dumped the 128-bit read buffer as well. As you can see from below, there were some really odd things going on:

I managed to find patterns for all the different read modes we needed, and the naturally aligned addresses for 4, 2 and 1 byte reads and writes. For writes, I used some lookup tables to assist with generating the byte enable signal – the 128bit write data input was byte masked, so that writing values which are not 128 buts in length do not appear to require a read-modify-write operation.

As it stands, I have still not found out why the read operations result in odd looking data from the controller. All things considered, it’s likely I have made an error early on in the implementation of the data swizzling for writes using the controller, which then dominos into the read operations. It’s something I’ll be returning to. This is fine for us just now, but I was intending on using the burst data to fill caches in later work on the CPU, and for that to work efficiently we will need to fully understand how this burst data organization works.

So. 256MB of ram unlocked, and passing fairly basic read/write tests. But the latency is very slow! Much slower than expected. I added some counters to the read state machine, and it seems to take around 22 cycles for a read. At 82MHz this is a long time. However, when searching online, it seems others have also seen this kind of latency, so I did not look into it – and instead tried to decouple the DDR memory clocking from that of the CPU and the rest of the SoC, like block rams.

Clock Domains

A clock domain can be imagined by separating out all aspects of your design into blocks. For each block which runs off a certain clock, it is on that clocks domain. Generally, blocks which synchronize off the same clock, and therefore are in the same domain, can pass signals between them relatively easily.

Currently, we have many clock domains in the SoC, but only 1 in the CPU side of things. The other clock domains are for HDMI output and due to the Block Rams being dual ported with dual clock inputs, we do not need to pass any raw signals across clock domains. For our DDR3 and CPU clocks to be different however, multiple signals would need to cross clock domains.

There are numerous issues that can arise when trying to pass synchronized signals across different clock domains. The simplest issue is when trying to read a fast clock signal from a slow clock – and the slow clock completely missing a strobe of the faster signal.

There are various methods of crossing signals from one clock to another, with some basic techniques starting with using multiple flip flops to latch signals into other domains. In VHDL, this just looks like a process which reads and latches the value from one signal into another on the destination clock.

As we are dealing with single read and write transactions, I decided to cross the CPU to Memory Controller clock domain using state handshaking. We will use two integer signals, which have multiple versions for stability in each clock domain, to implement a state machine process which crosses the clock safely. This can introduce several cycles of additional latency, however at the moment we are trying for a working solution – not a very efficient one. We can give up a few cycles of latency in the CPU clock domain if this means we can run the CPU much faster than the memory controller. The dataflow for a read looks like the following.

And writes:

The way this is implemented is with two processes, one clocked off the Memory Controllers 81.25MHz, the other off the higher CPU clock. The CPU process is actually just the MEM_proc one from before, to make interfacing with the CPUs own memory request logic easier. Each process reads the state of the other using stable signals which are latched into their own clock domain in an attempt to avoid metastability.
When both differently clocked processes are in a known state, I read the memory data required across the clock domain. I am not sure this is strictly safe – some say things such as data buses should be passed through a FIFO.

It took a few attempts to get this right, mainly due to realising that now I could not run the CPU at previously attainable speeds. It seems that just having the DDR3 controller synthesized into the SoC meant that the CPU core could not get over 166MHz. I have settled on 142.8MHz for now (1GHz/7 – easily attainable from my existing clocking system) as the CPU clock – fast enough for most needs but low enough that any timing issues do not arise. Many additional timing warnings appear on synthesis when the DDR3 controller is included, which was unexpected since this controller was generated by the internal tools. Like I mentioned last part, I do intend to look into these warnings and understand them with an aim of resolving them to unlock higher clocks. For now, 142MHz will do fine – and the additional memory space of DDR3 is very welcome!

With the DDR3 memory integrated, I also made quick microSD PMOD adaptor out of a level-shifting microSD adaptor intended for 5v Arduino projects using SPI. The level shifter is not needed for our purposes, as the FPGA already runs on 3.3v logic. No modification of the part was required – just soldering the pins to the 0.1” PMOD headers was enough to work.

The SPI port is exposed externally on Pmod JD. The way the SPI port is accessed is the same as discussed in the previous posts – just different memory mapped addresses. I created a new bootloader for the internal BRAM to perform a small memory test (which validates the endian byte swizzling), then initialize SPI and mount an SD card if present. It will examine a BOOT elf file and copy it into the required physical memory location, which can be in DDR3 or BRAM. It will then jump to that elf entry point to continue execution.

FPGA Utilization

With this all coming together to form a SoC design, I looked into the utilization report, which tells you how many resources your design is using. The report looks as follows:

The DDR3 memory controller takes up the largest amount of resources – not unexpected, they are incredibly complex! I’m quite happy with the current utilization of the RPU core itself. I’ve not been looking into optimization for resources, so having the CPU take up less than 6% of the Spartan S7-50 seems good. Looking at the breakdown of where the utilization of resources comes from, it seems the internal management of memory requests is taking a larger amount of resourced than I’d expect. It will be interesting to see how this changes when the endian swizzle logic is brought into the CPU core, and eventual cache logic added.

That bring this part to a close. I am now looking into implementing the required RISC-V Control and Status Registers (CSRs) into RPU. Currently I have memory-mapped data that should be available though CSR instructions. Adding them will be interesting as it will require changes to the CPU interface. Read about that next time!

As mentioned last time, the code for the DDR3 controller is already up on github. I still need to check that I’ve implemented the read/write data swizzling to the DDR3 correctly – I’ve a feeling there is a mistake in the writes somewhere, which then impacts the reads.
Thanks for reading, and as always, any questions can be directed at myself on twitter @domipheus.

Designing a RISC-V CPU in VHDL, Part 16: Arty S7 RPU SoC, Block Rams, 720p HDMI

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

It’s finally time – the big deploy onto Digilent’s Arty S7 board.

In my previous part, I went over at a high level the changes made to my TPU cpu core in order to make it consume RISC-V. The CPU itself is still very simple, and I removed some of the more interesting features from TPU such as interrupts. Interrupts as implemented on TPU would not comply to the RISC-V spec, so best they were stripped out for the time being.

After I got my RISC-V SoC up and running on MiniSpartan6+, I was looking to develop my own Spartan 7 FPGA board to use as a programmable computer kit – FPGA for Soft CPU, another for Soft GPU, a microcontroller for system management – maybe even a 3rd FPGA for chipset I/O. However I quickly came to my senses and realised just how much of an endeavour that is. Spinning my own PCBs, soldering those big BGA chips (and ultimately failing to solder those £40-a-piece chips) would be a very costly affair. It is still a long term goal, but in the meantime I wanted to get an off-the-shelf Spartan 7 FPGA board, essentially to bringup the FPGA side of an eventual move to my own development board. When I saw the Digilent Arty S7 announced last year, I kept an eye on it knowing it would be a contender for my upgrade path to the 7-series chips. The Arty S7 I ended up purchasing is the S7-50 variant, sporting the XC7S50 Spartan 7 FPGA with significantly more resources than my previous Spartan 6 board. It also had 256MB DDR3 RAM, but lacked HDMI connectivity. Thankfully, the HDMI/DVI-D output issue has been solved and you can read about that in a previous article.

I’m based in the UK, and was able to purchase the Arty S7-50 from for £119 delivered. Checking just now, it looks like the price has increased and you’d now be paying in the region of £125. It’s still a very nice board for that! So, this post is dealing with porting my existing RISC-v “SoC” to this new FPGA board.

The SoC consists of my RPU CPU, fast internal FPGA Block RAM storage, external (and slow!) DDR3 memory, my HDMI output with legacy text mode HDMI output, and finally, access to SD card storage via SPI. First, we have to tackle a new development environment.

Xilinx Vivado

With the new 7-series FPGAs comes a new set of design tools for authoring HDL and deploying to devices. As with ISE, which we used for Spartan 6 FPGAs, Vivado has a free “webpack” edition which is compatible with Arty S7. You can grab the download from Xilinx here. For clarity, I use the 2018.1 version. 2018.2 is the latest version as this is written.

The UI has changed quite a bit from ISE. In my opinion, Vivado is harder to navigate and parts of the interface seem very clunky. The general areas of interest remain; a Project Manager “flow” holds the source hierarchy, and the Simulation, Synthesis, and Implementation flows hold the respective details and commands for those aspects of the project.

To create projects for the board, we will need the board definition files which are provided by Digilent. There is a small guide on how to download and install them, available here. If you already have Vivado installed, you want to skip to section 3. Digilent also have a demo project available with your usual flashy LED hello world functionality, if you want to start very simple. For the rest of this article however, I’m jumping straight into the project.

Vivado base Arty S7 project

With our RPU core interface already defined, we need to just import the VHDL source files into our project and begin a new top level design which incorporates the core. As a reminder, the RPU core entity is shown below, and it still heavily resembles the old TPU core.

The top level component will have various input and output definitions. We require the following:

  • Clock input
  • Switch input
  • LED output
  • TMDS DVI-D HDMI output
  • SPI input/output
  • DDR3 memory input/output

We will leave out SPI and DDR3 definitions for now. With this, our definition is as follows:

entity rpu_top is
Port (
-- Input 100MHz clock

-- Input switches from board
sw : in STD_LOGIC_VECTOR (3 downto 0);

-- Output leds to board
led : out STD_LOGIC_VECTOR (3 downto 0);

-- HDMI (DVI-D) video output
hdmi_out_p : out STD_LOGIC_VECTOR(3 downto 0);
hdmi_out_n : out STD_LOGIC_VECTOR(3 downto 0)
end rpu_top;

We need to ensure these signals are mapped to the pin constraints of the Arty S7. My board is a Rev. B board, so I made a copy of the respective .xdc file provided with the Digilent board files and exited it to point to my named signals. If you’re familiar with Xilinx ISE, this is like the old .ucf constraints file from my TPU CPU project. An example is below, with LEDs, Clock and the HDMI output defined. With HDMI we need to ensure the signal standard is TMDS_33. This is the definition required to map to my simple HDMI Pmod connectors.

Block RAM

The next thing we need to do is figure out some block ram memory. The block ram primitive objects have changed in the Spartan 7 series FPGAs and are larger. Unlike the MiniSpartan6 implementation, where I manually initialized the block rams and added additional switching logic, I am now using the Xilinx IP Block memory generator. This is available in the free Webpack version of vivado and allows for generation of a block ram object of your own data width and memory size. Internally, multiple block ram primitives will be combined into a single interface. I want 64Kbyte block rams, which are made up of 16 smaller 4Kbyte hardware block rams. With a 32 bit data interface, this Block Ram Generator saves us a lot of work. It also allows for a initialization file to be provided, so the block ram can have a defined contents at reset. This allows us to have our bootloader present in memory for bootstrapping the SoC.

Another option in the block memory generator which we must ensure is unselected is “common clock”. We want our block ram to be a true dual-port ram, with separate clocks for each. This allows the rams to be connected both to the CPU core for read/write, and also another system, such as our HDMI character generator for use as text console storage – running at a pixel clock rate, instead of the CPU clock rate.

With this, you can create the interface via the GUI and generate an HDL wrapper.

I created a wrapper so I could edit the source and ensure the data out ports were tri-stated when the block ram was disabled, for easier plumbing of multiple block rams together.

With our block ram wrapper available, we can connect this to memory interface of the RPU core. This is fairly simple, and we can attack multiple block rams by enabling them and muxing data lines depending on address bits. Because we tri-state the data output bus when the block ram is not selected we should be able to use a single output for all block rams, but for now I assign a signal to each output and explicitly select the one we need.

With the above, we should have 196Kbyte of total addressable RAM. However, there is a significant issue which needs attention before any code will run from these BRAMs. RPU currently expects data presented to it to be formatted into the expected data format. The memory interface does not have a byte enable or such like, as you would typically expect. So memory requests need swizzled for endianness and size prior to being given to RPU. This system will be getting a significant overhaul soon, which will change all of this so RPU is presented with simply a raw view of memory – but for now, we need to swizzle data from the BRAM before it enters the CPU. To do this, there are two additional processes, and the signals assigned by these processes are what is read from or written to the CPU interface.

There is one more process associated with memory, and that handles the state machine for handling the requests from the CPU. It also will assign various I/O data. It simply maps addresses to signals. I think this process can be implemented better, but for ease of editing for tinkering it works well for now.

Getting some code running

At this point, you should be able to write your “chasing LED” program and have it as the BRAM initial contents, so when the Arty S7 board is flashed with the FPGA bitcode you will see the onboard LEDs flash.

The code for this is just as simple as you’d expect.

    unsigned int i = 0u;
    volatile unsigned int* ui_addr_leds = (unsigned int *)IO_ADDR_LEDS;

    while (1)
        *ui_addr_leds = (i++)>>18;

IO_ADDR_LEDS above is defined to be 0xf0009000, so the MEM_proc process you can see earlier picks up this memory write and redirects the lowest significant 4 bits to the external LED I/O pins.

I previously mentioned that I’d build my own RISC-V toolchain using Windows Subsystem for Linux. I attempted to build the latest version of riscv-tools, however I kept running into build issues this time around. I instead have switched to using the GNU MCU Eclipse RISC-V Embedded GCC toolchain which is very handily released as a full windows binary package. It can be obtained from the github releases project page. A basic main() function with the above code is compiled with a linker script to place it at location 0x00000000 with no standard libraries or start files. The resulting elf binary is tiny, and you can use objdump with the -s argument to get part of a hex dump output which you can then transform into the .coe file required by the Xilinx block ram generator to use as the initial BRAM contents. The .coe file format is simple, and consists of two declarations – the radix of the data to follow, and a comma separated vector containing the data itself.

37810000, 1301c1ff, ef005074, 6f000000,
13060500, 13050000, 93f61500, 63840600,

I use this method to create the real bootstrap firmware which initializes and ends up copying code from SD card into the DDR3 ram for execution – but more on that in the next post! I was also able to use the new simulator in Vivado to check internal signals while some code ran.

One thing that caught me out is that changing the .coe initial BRAM contents file and rebuilding the project will not bring the changes from that file into the new BRAM IP. You need to right click the BRAM in the designer, select reset output products, and then generate them again for the updated .coe to be integrated. A rather annoying, slow and unnecessary step in my opinion – but maybe there is a reason for this that I do not understand as yet.

HDMI output

Flashing LEDs are cool, but we have a character generator to port! My previous miniSpartan6 design ran HDMI out at 640×480, using a widely available USB powered HDMI panel targeted at Raspberry Pi use. With ArtyS7, I wanted more resolution, and have defaulted to outputting 720p60. The changes to allow this on the DVI-D side of things are minimal – pixel clock updated, and the VGA timing signals for 1280×720 at 60Hz are used.

The pixel clock for 720p should be 74.25MHz, but the much-easier to obtain 75MHz will generally still work. The previous character generator (discussed in this blog post) works off of a 5x pixel clock – which in this case would be 375MHz. This is too high – the Spartan7 block rams of the Arty S7 are rated for around 350MHz – so we need to architect the character generator to run off of the raw pixel clock of 75MHz. This is actually fairly simple to do. The character generator works by feeding in an X and Y position of the current VGA timing location as it’s scanned out. If we offset the timing locations, by putting the various sync signals through an 8-deep FIFO, we can feed the character generator pixel values which are 8 pixels early – allowing the generator 8 cycles of latency in order to perform the necessary text and font lookups from BRAM. As the font glyphs are 8 pixels wide, we can prefetch the next glyph data. By the time the sync signals are at the end of the FIFO, the character generator will be providing the correct pixel colour values for the glyph row required.

The new pipeline for the character generator is as follows:

This provides a 160×45 text console display with 8×16 glyphs. To connect the FPGA dev board to an HDMI monitor, I took what I learned from my previous HDMI pmod post and made my first PCB using PCBWay. Then I soldered an surface mount HDMI connector and the PMOD 0.1″ angled header terminals. I may post a video of this in the future. I tinned the surface pads with a soldering iron and used a hot air gun with lots of flux to reflow the connector to the pads.

It’s a very useful converter! And for my first attempt at soldering 0.5mm pins, a successful first PCB 🙂

I am using an Atrix Lapdock as my HDMI sink – it’s a laptop form factor with HDMI input for the screen, and a USB hub with integrated keyboard, trackpad and battery. Again, this is usually used to make raspberry pi laptops, as the Lapdock itself can provide 5v power to devices as well as powering the screen. So with this, I have a RISC-V Laptop!

That is it for now – I intended to discuss the DDR3 memory in this post, but it just got too long. That post will follow shortly.

I have put the RPU CPU Core HDL on Github, as well as the Arty SoC project. This code is in front of these blog posts and includes the DDR3 implementation, so if you are impatient you can go and look now. There are timing constraint violations introduced with the HDMI output and DDR3 IP implementation, but I have yet to look into them – and the built FPGA bitfile flashes to my board and runs at a lower speed.

Thanks for reading! As always, I am available on twitter @domipheus for any queries. If you try out the SoC from github, let me know!

Designing a CPU in VHDL, Part 15: Introducing RPU

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

It’s been a while. Despite the length of time and lack of posts, rest assured a significant amount of progress has been made on my VHDL CPU over the last year. I’ve hinted at that fact multiple times on twitter, as various project milestones have been hit. But what’s changed?

First and foremost; the CPU now consumes RISC-V. It’s decoder, ALU and datapaths have been updated. With that, the data width is now 32 bit. The Decoder accepts the RV32I base integer instruction set.

I’d been putting off multiple side-projects with my existing TPU implementation for a while. It’s 16-bit nature really made integrating the 32MB SDRAM into the system a rather pointless affair. I’m all for reminiscing but I did not want to go down the rabbit hole of memory bank switching and the baggage that would entail. The toolchain for creating software was already at the limit – not the limit in terms of what could be done – but the limit in what I’m prepared to do in order to perform basic tasks. We all love bare metal assembly, but for the sake of my own free time, I wanted to just drop some C in, and I was not going to make my own compiler. I looked into creating a backend for LLVM, but it’s really just another distraction.

As a reminder, here is where we left off.

Block diagram of TPU CoreSo, what’s actually changed from the old 16-bit TPU?

  • Many more registers,
  • Datapath widened to 32-bit
  • Decoder and ALU updates for RV32I ISA
  • Glue logic/datapath updated for new functions

The register file is basically the exact same as before with the 16-bit TPU, extended for 32 entries of 32 bits names x0-x31. In RV32I, x0 is hardwired to 0x00000000. I’ve left it a real register entry, but just never allow a write to x0 to progress, keeping the entry always zero. The decoder and ALU is fairly standard really – there are a small number of instruction formats, and where immediates are concerned there is some bit swapping here and there, but the sign bits are always in the same place which can make things a bit easier in terms of making the decode logic easier to understand.

The last item about datapath changes follows on from how RISC-V branch instructions operate. TPU was always incredibly simple, which meant branching was quite an involved process when it came to function calling and attempting to get a standardized calling convention. There was no call instruction, and the amount of operations required for saving a calculated return from function address, setting up call arguments on the stack and eventually calling via an indirect register was rather irritating. With the limited amount of registers available on TPU, simplification via register parameter passing didn’t solve much as you would be required to save/restore register contents to the stack regardless.

RISC-V JAL instruction definition screenshot from ISARISC-V has several branch instructions with significant changes to dataflow from that of TPU. In addition to calculating the branch target – which is all TPU was able to do, RISC-V calculates a return address at PC+4, which is then written to a register. This means our new ALU needs two outputs, the standard result of a calculation, and the branch target. The branch target with the shouldBranch output status feeds into our existing PC unit, with the result feeding as normal the register file/memory address logic. The new connections are shown below in yellow.

RPU block diagramIn terms of old TPU systems, the old interrupt system needs significant updating and so is disabled. It’s still got block ram storage, UART, and relies on the same underlying memory subsystem, which in all honesty is super bloated and pretty poor. The memory system is currently my main focus – it’s still made up from a single large process at the CPU core clock. It reads like an imperative language function, which is not at all suitable for the CPU moving forward. The CPU needs to interface with various different components, from the block rams, UART, to SPI and SDRAM controllers. These can be running at different clocks and the signals all need to remain valid across these systems. With everything being accessed as memory-mapped IO, the memory system is super important, and I’ve already run into several gotchas when extending it with an SDRAM controller. More information on that later.


As I mentioned earlier, one of the main reasons for moving to RISC-V was toolchain considerations. I have been developing the software for my ‘soc’ with GCC.

With Windows 10, you can now use the Linux Subsystem for Windows to build linux tools within Windows. This makes compiling the RISC-V toolchain for your particular ISA variant a super simple process. has some details of how to build relevant toolchains, and Microsoft have details of how to install the Linux Subsystem.

Data Storage

With the RISC-V GCC toolchain built, targeting RV32I, I was able to write a decent amount of code in order to create a bootloader which existed inside the FPGA block ram. This bootloader looked for an SD card in the miniSpartan6+ SD card slot. It initialized the SD card into slow SPI transfer mode so we could get at it’s contents.

Actually getting to data stored on an SD card is incredibly simple, and has been written up nicely here. In terms of how the CPU accessed the SD card, I found a SPI master VHDL module and threw it into my project, and memory mapped it so you could use it from the CPU.

#define SPI_M1_CONFIG_REG_ADDR 0x00009300
#define SPI_M1_BUSY_REG_ADDR   0x00009304
#define SPI_M1_DATA_REG_ADDR   0x00009308

void spi_sd_reset()
	volatile uint32_t* spi_config_reg = (uint32_t*)SPI_M1_CONFIG_REG_ADDR;

	uint32_t current_reg = *spi_config_reg;
	uint32_t reset_bit = 1U << SPI_CONFIG_REG_BIT_RESET;
	uint32_t reset_bitmask = ~reset_bit;

	// Multiple writes gives the controller time to reset - it's clock
	// can be slower than this CPU
	*spi_config_reg = current_reg & reset_bitmask;
	*spi_config_reg = current_reg & reset_bitmask;
	*spi_config_reg = current_reg & reset_bitmask;
	*spi_config_reg = current_reg & reset_bitmask;

	// return to original, but ensure the reset bit is set (active low), 
	// in case the register was previously clobbered by some other operation
	*spi_config_reg = current_reg | reset_bit;

uint8_t spi_sd_xchg(uint8_t dat)
	volatile uint32_t* spi_busy_reg = (uint32_t*)SPI_M1_BUSY_REG_ADDR;
	volatile uint32_t* spi_data_reg = (uint32_t*)SPI_M1_DATA_REG_ADDR;

	*spi_data_reg = (uint32_t)dat;

	while (*spi_busy_reg != 0);

	return *spi_data_reg;

A few helper functions later, the bootloader did a very simple FAT32 root search for a file called BOOT, and copied it into ram before jumping to it. It also copied a file called BIOS to memory location 0x00000000 – which had a table of I/O functions so I could fix/extend functionality without needing to recompile my “user” code.

typedef struct {
    FN_sys_console_put_stringn    sys_console_put_stringn;
    FN_sys_console_put_string     sys_console_put_string;
    FN_sys_console_setcursor      sys_console_setcursor;
    FN_sys_console_setpen         sys_console_setpen;
    FN_sys_console_clear          sys_console_clear;
    FN_sys_console_getcursor_col  sys_console_getcursor_col;
    FN_sys_console_getcursor_row  sys_console_getcursor_row;
    FN_sys_console_getpen         sys_console_getpen;
    FN_sys_clk_get_cyclecount     sys_clk_get_cyclecount;
    FN_spi_sd_set_clkdivider      spi_sd_set_clkdivider;
    FN_spi_sd_set_deviceaddr      spi_sd_set_deviceaddr;
    FN_spi_sd_xchg                spi_sd_xchg;
    FN_spi_sd_reset               spi_sd_reset;
    FN_uart_tx1_send              uart_tx1_send;
    FN_uart_rx1_recv              uart_rx1_recv;
    FN_uart_trx1_set_baud_divisor uart_trx1_set_baud_divisor;
    FN_uart_trx1_get_baud_divisor uart_trx1_get_baud_divisor;
// Above functions are basic bios and provided by BIOS.c/BIN
// Below are extensions, so null checks must be performed to check 
// initialization state
    FN_pf_open    pf_open;
    FN_pf_read    pf_read;
    FN_pf_opendir pf_opendir;
    FN_pf_readdir pf_readdir;

With the FPGA set to look for this BOOT file off an SD card, testing various items became a whole lot easier. Not having to change the block ram and rebuild the FPGA configuration to run a set of tests saved lots of time, and I had a decent little setup going, however I was severely limited by RAM – many block rams were now in use, and yet I still did not have enough RAM. I could have optimized for space here and there, but when I have a 32MB SDRam chip on the FPGA dev board it seemed rather pointless. It was finally time to get the SDRAM integrated.


Getting an SDRAM interface controller for the chip on the miniSpartan6+ was very easy. Mike Field had done this already and made his code available. The problem was that whatever I tried, the data out from the SDRAM was garbage.

Or, should I say, out by one. Everything to do with computers has an out by one at some point.

Anyway, data was always arriving delayed by one request. If I asked to read data at 0x0000100, I’d get 0. If I asked for data at 0x0000104, I’d get the data at 0x00000100. I could tell that writing seemed to be fine – but reading was broken. Thankfully discovering this didn’t take too much time, due to being able to boot code off the SD card. I wrote little memory checking binaries and executed them via a UART command line, the debug data being displayed on the HDMI out. It was pretty cool, despite me getting wound up at how broken reading from the SDRAM was.

memory testIt took a long time to track down the actual cause of this read corruption. For the most part, I thought it was timing based in the SDRAM controller. I edited the whole memory system multiple times, with no changes to the behavior. This should have been a red flag for me, as the issue was being caused by the constant through all of these tests – the RPU core!

In RPU, the signals that comprise the input and output of each pipeline stage get passed around and moved from unit to unit each core clock. This is where my issue was – not the “external” memory controller!

Despite my CPU being okay with waiting for the results of memory reads from a slower device, it was not set up for correctly forwarding that data on when the device eventually said the request had completed. A fairly basic error on my part, made harder to track down by the fact the old TPU CPU design and thus RPU had memory requests going through the ALU, then control unit, and some secondary smaller memory controller which didn’t really do anything apart from add further latency to the request.

I fixed this, which immediately made all the SDRAM memory space available. Hurrah.

It’s rather annoying this issue did not present itself sooner. There are already many “slow” memory mapped devices connected to the CPU core, but they generally work off of FIFOs – so despite them being slow, they are accessed via queries to lines to say whether there is data in the FIFO, and what the data is. These reads are actually pretty fast, the latency is eaten elsewhere – like spinning on a memory read. Basic SDRAM read latency however for this was in the 20 cycle range at 100MHz, way more than previously tested.

Having the additional memory was great. Now that we have a full GCC toolchain, I could grab the FatFs library, and have a fully usable filesystem mounted from the SD card. Listing directories and printing the contents of files seemed like such a milestone.

Next steps

So that’s where we are at. However, things be changing. I have switched over to Xilinx Spartan 7, using a Digilent Arty S7 board (the larger XC7S50 variant). I’m currently porting my work over to it.

If you saw my last blog post, you’d have seen that HDMI out at least works, and now I’m working on porting the CPU over, using block rams and SD card for memory/storage. Lastly will be a DDR3 memory controller. 256MB of 650MHz memory gives much more scope for messing up my timing even more. It should be great fun!

I hope to document progress more frequently than over the past year. These articles are my projects documentation, and provides an outlet to confirm I actually understand my solutions, so the lack of posts has made things all the more disjoint. The next post shall be soon!

Thanks for reading. Let me know of any queries on twitter @domipheus.