Designing a RISC-V CPU in VHDL, Part 21: Multi-cycle execute for multiply and divide

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

One of the things which RPU has done from the start is keep the cpu pipeline very simple. It’s a Fetch, Decode, Execute, [Memory], Writeback pipeline, but it does not run pipelined. I.e, at any cycle, only one of these stages is active, and all state within the cpu corresponds to a single instruction from fetch to writeback. Due to this our instructions-per-cycle (IPC) count is very low, but the implementation is simpler to understand in terms of dataflow.

The control unit decides what stage is active, drawing on state outputs from previous stages to make those decisions. Each of the decode, execute and writeback pipeline stages takes a single cycle to execute. Memory stages, like Fetch and Memory can take a variable amount of cycles. The CSR unit actually requires multiple cycles to operate – a write or read-modify-write operation like csrrci takes 3 cycles to fully complete – however this is done asynchronously with the traditional pipeline, so does not impact upon the 1 cycle per stage limitation. The execute stage for a csrrci instruction remains one cycle.

Some operations really do require multiple cycles to execute – without asyncronous operation – and the execute stage will need to wait for it to complete. An example of this is integer multiply and divide, operations defined in the RISC-V M extension. RPU did not support these operations, but now does – along with an execution phase which can span an arbitrary amount of clock cycles. This is how it’s implemented.

Let’s start – Multiply

Currently, the RPU decode stage requests an illegal instruction exception be raised if an opcode relating to the M-extensions are found. Previously, I have used this mechanism to implement software multiply and divide in the exception handler, which allowed Zephyr RTOS to boot while building binaries which targeted rv32m – slow, but effective.

The first thing we need to change is this behaviour. Allow the M-extension functions to pass the decoder, and set a multicycle bit in the decoder output for the control unit to consume.

The multi-cycle bit

The control unit of RPU is basic, and the main cpu stages simply advanced to the next stage every cycle – as long as no interrupts are requested. We need to modify this behaviour when we are in the execution stage to enable multi-cycle use. We do this with a pair of new signals, one from the decoder “O_multycyAlu” and one from the ALU, “O_wait”.

The output from the decoder “O_multicyAlu” is really an optimization; we could ignore this and simply rely on the O_wait from the ALU, but this would add a cycle of latency to every excute stage – something we do not want. So, the decoder knows that the multiply and divide instructions take multiple cycles, so we tell the control unit this. The control unit then keeps the execute state active, until the ALU O_wait output is not asserted. Simple, and works well for our use case. This is when the differences come in for how multiply and divide will be implemented on RPU, so we’ll jump right back into how multiply is handled now in the ALU.

From the spec, we can see that there are 4 different multiply operations we need to implement. MULW only applies to rv64, so is ignored in our rv32 implementation.

  • MUL – write bottom 32-bit result of rs1 * rs2 into rD.
  • MULH – write upper 32-bit result of (signed(rs1)) * (signed(rs2)) into rD.
  • MULHU – write upper 32-bit result of (unsigned(rs1)) * (unsigned(rs2)) into rD.
  • MULHSU – write upper 32-bit result of (signed(rs1)) * (unsigned(rs2)) into rD.

For my implementation, I perform three multiplies for the various signed/unsigned options, and then in the next cycle I select what is written to the destination register.

I rely on the VHDL integer typecasts for unsigned/signed, apart from the MULHSU case, whereby I manually sign-extend one operand by appending the MSB, and then the other operand gets ‘0’ appended for forced unsigned. I use a 2-state machine to keep track of when to write the result – but as you can see from the comment, we immediately deassert O_wait on the first encounter, as there is always an additional cycle available to complete the operation.

FPGA Multiply

“But Colin – you’re just… multiplying, on an FPGA, and it’s working?!” – Yup. My target FPGA, the Xilinx Spartan 7-50 on a Digilent Arty S7 board synthesizes a 32-bit integer multiply operation in the VHDL into a set of cascaded DSP primitive slices – neat huh?

Just like the blocks I use for fast ram on the FPGA, there are fixed function hardware blocks embedded into the FPGA which can be utilized by user designs. In this case, it’s the DSP48 block, which has a 25x18bit multiplier. The synthesis tools chain multiple of these blocks together to generate our 32×32 results.


For division, it’s not so easy. I need to implement it manually. Thankfully, integer binary long division is an algorithm which can be easily understood and transformed into VHDL for use in my ALU.

I decided to create a new division entity which encapsulated this algorithm and exposed signals for the ALU to use. I took the algorithm explained on wikipedia as above, and then made it also work for the various operations required:

As with multiply, some operations only apply to rv64 which we can ignore. We need to implement the following:

  • DIV – signed integer division
  • DIVU – unsigned integer division
  • REM – signed integer remainder
  • REMU – unsigned integer remainder

The long division algorithm assumes unsigned integers, so for signed use we can test the sign bits when we start a divide operation. We then feed the negated values, if negative, through the unsigned operation loop – before restoring the sign at the end of the operation to get the correct result.

There are some edge cases that are handled separately – notably division by zero. In the RISC-V spec, division by zero specifically does not raise an exception.

The division unit runs in three states; IDLE, INFLIGHT and COMPLETE.

STATE_IDLE: Waits for the execution instruction, and if found, sets up the inputs for the required operation and passes to inflight. For division by zero or one, these edge cases are handled and we short circuit to complete.

STATE_INFLIGHTU: Runs for 32 cycles, performing the unsigned binary long division algorithm loop.

STATE_COMPLETE: Selects the required output values, and adjusts for signed/unsigned operation.

The inflight state performs the loop from the algorithm, and the VHDL can be seen to match the wikipedia psudocode:

Note that when R is referred, we always need to refer to its full representation as s_R(30 downto 0) & s_N(s_i), as that initial assignment in the wikipedia code would not be available until the next cycle in the VHDL.

The interface for the division unit is as follows.

The operations passed in I_op correspond to two bits of the funct3 part of the opcode for divide operations. This means they pass straight through from the decoder to this divide unit. You’ll notice there is an interrupt output for this unit, but currently due to the RISC-V spec this is not used.

Now that we have a facility for performing the 32-bit integer divide operations we require – it’s time to put it all together! We place the the divide unit component in our alu, and control it when we recieve a divide operation. We control the o_wait output of the ALU, keeping it active until the divide unit is complete, before forwarding the result to the ALU’s own result. The control unit then transfers us to writeback, for the register write of the result.

And that’s it. I used the riscv-compliance runner for mul and div to test my implementation, and fixed a few issues with signed handling in divide – but other than that, things were generally straightforward.

In terms of performance, the biggest win was for multiply- two cycles is pretty quick compared to the super slow emulation with invalid instruction interrupts, or soft multiply libraries. Divide wasn’t as big a win, but this is mainly due to my use cases not making much use of divide. In Doom, there was a gain with the hardware multiply – I hope to cover that in more depth in a post specifically on optimizing the SoC for Doom. Using hardware multiply support instead of relying on the software multiply I was using (based on llvm compiler-rt) resulted in a 28% increase in FPS on timedemo demo3.

The performance is as follows:

Multiply: execute: 2 cycles
Divide/Remainder: execute: 34 cycles

Possible Optimizations

Reading the spec on division operations you can see the above note on operation fusing. This is an optimization whereby if we need both the quotient and remainder from an operation, the cpu will only do the slow division operation once, and then use the results from that to complete both DIV and REM requests.

The way the division unit has been fixed to the ALU means at the moment, for an instruction sequence of say div r4, r1, r2; rem r5, r1, r2, we will take at least 68 cycles to complete within the execution stage. Given that the inputs to the instructions (r1, r2) remain constant in this sequence, we should be able to reuse the intermediate results of the division instruction to complete the remainder operation quicker.

This would look somewhat like the following:

  • Within the alu_int32_div entity; when we get an I_exec command, test the input operands against those for a valid result we have previously computed.
    • If the new input match the previously computed, valid, operation;
    • Perform any state modification which depends on operation – as the operation can change, despite the data inputs being constant
    • Move to STATE_COMPLETE, completely skipping STATE_INFLIGHT
  • Write the output as before, which will draw on the currently requested operation, and use the existing values for the s_Q and s_R intermediate results.

With this set-up, skipping the STATE_INFLIGHT section will save 32 cycles – meaning out div+rem operation will now only take 36 cycles, a considerable speedup if this is a hot path in the code.

As the comparisons for testing the input operands will compare actual data values, rather than register numbers, we don’t need to worry about hazards. A register write to r1 between the operations of a different value would mean the inflight short-circuit would not be taken. Additionally, it means that the DIV+REM instructions do not need to follow each other for this to work.

I’ve not implemented this optimization in RPU yet, but it is on my to-do list!


We’ve complicated the pipeline picture a little, but added some nice functionality in the process.

For those of you who are also making RV32 cores – you may be interested to know you can pass a GCC toolchain an option which emits multiply instructions, but not division. This would have been great for me during development of RPU 1.0! I Learned this from Luke Wren over on twitter.

Additionally, seems the RISC-V spec is getting a multiply-only extention which in my view is a good move.

That’s it for this part in the series! thank you for reading. It’s been a long time since the last part, my excuse is that’s been a fairly crazy year. The multiply and divide functionality is already on github, and has been for quite a while as part of the RPU 1.0 release. If you have any further questions, please do not hesitate to send me a message on twitter @domipheus!

Designing a RISC-V CPU in VHDL, Part 20: Interrupts and Exceptions

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

Interrupts and exceptions are important events that any CPU needs to handle. The usual definition is that interrupts happen outside of the CPU – timer events, for example. Exceptions occur within the CPU, like trying to execute an invalid instruction. These events are handled all the time within a system, and whilst some signify faults and error conditions – most are just handling system functionality.

I mentioned earlier in the series how my previous CPU (TPU) had interrupt handling. However the way it was implemented needed significant modification to work in a RISC-V environment. RPU now supports many more types of exception/interrupt, and as such is more complex.

Before we go further, in the RPU code I use the term interrupt to refer to both interrupts and exceptions. Unless I explicitly mention exceptions it, assume I mean both types.

The Local Interrupt Unit

RPU will implement the timer interrupts as external, similar to how TPU did it. It will also support in invalid instruction, system calls, breakpoints, invalid CSR access (and ALU), and misaligned jump/memory. These generally fit into 4 categories:

  • Decoder Exceptions
  • CSR/ALU unit exceptions
  • Memory Exceptions, and
  • External interrupts

There are more subcategories to these, defined by an additional 32bit data value describing the cause further, but these 4 categories can fit nicely into 4 interrupt lines. The CPU can only handle one at a time, so with this in mind I created a Local Interrupt unit, the LINT unit, which will take all the various interrupt request and associated data lines, and decide which one actually makes its way into the control unit for handling. Internally, it is implemented as a simple conditional check of the different input categories, and then forwarding the data to the control unit, waiting for a acknowledge reset signal before going on to the next interrupt, if multiple were being requested at once. The LINT also handles ack/reset forwarding to the original input units.

With this unit complete, we can add an O_int and O_intData output, as well as an acknowledge input for reset, to our decoder unit. This will attempt to raise an exception and set the intData output to be the cause as defined by the RISC-V standard, which will let any interrupt handler know which kind of request – invalid instruction, ecall/system call, breakpoint – caused the exception.

The CSR unit from the previous part already has a facility to raise an exception – it can check the CSR op and address to ensure the operation is valid. For instance, attempting to write a read only CSR would raise an access exception. Whilst this is all implemented and connected up, the compliance suite of tests does not test access interrupts, so its not extensively tested. We will need to reinvestigate that once RPU is extended to fully support the different runtime privilege levels.

Memory exceptions from misaligned instruction fetch are found by testing the branch targets for having the 1st bit set. The 0th bit by the specification is always cleared, and we don’t support the compressed instruction set, so a simple check is all we need for this.

The load and store misaligned interrupts are found by testing memory addresses depending on request size and type. RPU will raise exceptions for any load or store on a non-naturally aligned address.

Lastly, external interrupts have the signal lines directly routed outside of the CPU core, so the SoC implementation can handle those. In the ArtyS7-RPU-SoC, timer interrupts are implemented via a 12MHz clock timer and compare register manipulated via MMIO. We could also implement things like UART receive data waiting interrupts through this.

Execution Control

Now, we know what can trigger interrupts in the CPU, but we need to lay down exactly the steps and dataflow required both when we enter an interrupt handler, and exit from it. The control unit handles this.

As a reminder – here is how a traditional external interrupt was handled when ported simply through to RPU from my old TPU project. You can see the interrupt had to wait until a point in the pipeline which was suitable, which is okay in this instance. However, exceptions require significant changes to the control unit flow.

Interrupt entry / exit

On a decision being made to branch to the interrupt vector – the location of which is stored in a CSR – several other CSR contents need modified:

  1. The previous interrupt enable bit is set to the current interrupt enable value.
  2. the interrupt enable bit is set to 0.
  3. the previous privilege mode is set to the current privilege
  4. the privilege mode is set to 11
  5. the mcause CSR is set to the interrupt data value
  6. the mepc CSR is set to the PC where the interrupt was encountered.
  7. the mvtval CSR is set to the location of any exception- specific data, like the address of a misaligned load.

On exit from an interrupt via mret, the previous enable and privilege values are restored. These csr manipulations will occur internal to the CSR unit, using int_exit and int_entry signals provided to it by the control unit.

The control unit

The previous TPU work implemented interrupts by checking for assertion of the interrupt signal at the end of the CPU pipeline, just before the fetch stage. This works fine for external interrupts, and it keeps complexity low, due to not having to pause the pipeline mid-execution. However, we now have different classes of interrupt which need mid-pipeline handling:

  • Decoder interrupts
  • CSR/ALU interrupts
  • Misalignment/Memory interrupts

For decoder interrupts, we can check for an INT signal from the decoder, and ensure any memory/register writes don’t occur a few cycles later. The misalignment interrupts can be triggered at fetch and memory pipeline stages and are more complex.

In the previous part of this series, where I added CPU trace support, I discussed some of the logic flow that a decoder interrupt takes. It followed on that how different types of interrupt have higher priority, and a priority system is needed. The LINT was supposed to handle this priority system – and did in general – in that, decoder exceptions are higher priority than external exceptions. However, the LINT has no concept of where execution is in the pipeline, and how at certain execution stages actually require certain exceptions are handled immediately, regardless of how many cycles previously another interrupt was requested. This required another rather clunky set of conditions be provided to the LINT unit as a set of enables bits for the various types. Some enables were only active during certain pipeline stages. My decision to not separate out how Interrupts (like external timers) and Exceptions (CPU internal faults that must be immediately handled) has bitten me here by requiring specific enable overrides, and some more workarounds.

Memory Misalignment Exceptions

I thought the memory misalignment exceptions would be a fairly simple addition, however they presented an interesting challenge, due to the fixed timing that is inherent within the memory/fetch parts of the core.

Discovering whether a misalignment exception should assert is fairly simple, we can have a process which checks addresses and asserts the relevant INT line with associated mcause value to indicate the type of exception:

  • load
  • store
  • branch

The LINT unit has a cycle of latency, and when it’s a memory exception we are talking about, that cycle of latency means it’s too late – the memory operation or fetch will have already been issued. The latency is acceptable for a decoder interrupt, because the writeback phase is still 2 cycles away and the interrupt will be handled by that point, avoiding register file side-effects.

The side effects of an invalid memory operation are hard to track, so instead we forward a hint to the control unit that a misalignment exception is likely to occur shortly. It’s rather clumsy, but this hint gets around the LINT latency and allows the control unit to stall any memory operations. This stall is just long enough for the operations to not issue, and the exception be successfully handled once the LINT responds.

These stalls are implemented by a cycle counter in the control unit, counting down a fixed number of stall cycles if a misalignment hint is seen. During each of these cycles, the interrupt signal is checked in case we need to jump to the handler. The control unit is definitely complicated significantly by these checks. There is a lot of copy and paste nonsense going on here.

Lastly, to further complicate things; my memory-mapped IO range of addresses (currently 0xF0000000 – 0xFFFFFFFF) has various addresses which I write from the bootloader firmware in an unaligned way. To fix this I’ve excluded this memory range from the misalignment checks. I’ll fix it another time.

So now we have this align_hint shortcut which bypasses the LINT and allows for correct handling of the various memory exceptions.

Further Bugs

An issue that was discovered during a simulation of the decoder interrupts specifically for ecall and ebreak. The method for acknowledging interrupt requests was that the LINT would assert an ACK signal for the unit it selected, and that unit would then de-assert it’s INT signal. The problem is that I’d placed this de-assertion insider the rest of the decoder handling process, which was only active when the pipeline decoder enable was active. By the time the LINT wanted to acknowledge the decoder interrupt, the enable was not active. This resulted in an infinite loop of interrupt requests from the decoder, which was not much use!

Another bug was that I was latching the wrong PC value to the mepc register. This was not caught until I started debugging actual code, but would have shown up pretty obviously in the simulator. The fix was to not grab the current PC value for mepc but to latch the correct value at the time the interrupt was fired.

Lastly, as I was testing the riscv-compliance misalignment exception test, I realised an exception was being raised when it shouldn’t. Turns out I had missed a point in the ISA spec, whereby jump branch targets always have bit 0 masked off. An easy thing to fix, but annoying I had this bug for so long.

But, RPU now passes risc-v compliance 🙂

Summing up

With the interrupt support now in the CPU design, I have realised just how many mistakes I have made in this area. Not separating out different types for interrupts and exceptions (which need handled immediately in the pipeline) meant the LINT needed these ugly overriding hints for the control unit in order to operate correctly. It all seems a bit messy.

However, it does work. I can fix its design later, now I have a working implementation to base changes off of.

Thanks for reading, as always – feel free to ask any follow up questions to me on twitter @domipheus.

Designing a RISC-V CPU in VHDL, Part 19: Adding Trace Dump Functionality

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

For those who follow me on twitter, you’ll have seen my recent tweets regarding Zephyr OS running on RPU. This was a huge amount of work to get running, most of it debugging on the FPGA itself. For those new to FPGA development, trying to debug on-chip can be a very difficult and frustrating experience. Generally, you want to debug in the simulator – but when potential issues are influenced by external devices such as SD cards, timer interrupts, and hundreds of millions of cycles into the boot process of an operating system – simulators may not be feasible.

Blog posts on the features I added to RPU to enable Zephyr booting, such as proper interrupts, exceptions and timers are coming – but it would not have been possible without a feature of the RPU SoC I have not yet discussed.

CPU Tracing

Most real processors will have hardware features built in, and one of the most useful low-level tools is tracing. This is when at an arbitrary time slice, low level details on the inner operation of the core are captured into some buffer, before being streamed elsewhere for analysis and state reconstruction later.

Note that this is a one-way flow of data. It is not interactive, like the debugging most developers know. It is mostly used for performance profiling but for RPU would be an ideal debugging aid.


For the avoidance of doubt; I’m defining “A Trace” to be one block of valid data which is dumped to a host PC for analysis. For us, dumping will be streaming the data out via UART to a development PC. Multiple traces can be taken, but when the data transfer is initiated, the data needs to be a real representation of what occurred immediately preceding the request to dump the trace. The data contained in a trace is always being captured on the device in order that if a request is made, the data is available.

These requirements require a circular buffer which is continually recording the state. I’ll define exactly what the data is later – but for now, the data is defined as 64-bits per cycle. Plenty for a significant amount of state to be recorded, which will be required in order to perform meaningful analysis. We have a good amount of block rams on our Spartan 7-50 FPGA, so we can dedicate 32KB to this circular buffer quite easily. 64-bits into 32KB gives us 4,096 cycles of data. Not that much you’d think for a CPU running at over 100MHz, but you’d be surprised how quickly RPU falls over when it gets into an invalid state!

It goes without saying that our implementation needs to be non-intrusive. I’m not currently using the UART connected to the FTDI USB controller, as our logging output is displayed graphically via a text-mode display over HDMI. We can use this without impacting existing code. Our CPU core will expose a debug trace bus signal, which will be the data captured.

We’ve mentioned the buffer will be in a block ram; but one aspect of this is that we must be wary of the observer effect. This issue is very much an issue for performance profiling, as streaming out data from various devices usually goes through memory subsystems which will increase bandwidth requirements, and lead to more latency in the memory operations you are trying to trace. Our trace system should not effect the execution characteristics of the core at all. As we are using a development PC to receive the streamed data, we can completely segregate all data paths for the trace system, and remove the block ram from the memory mapped area which is currently used for code and data. With this block ram separate, we can ensure it’s set up as a true dual port ram with data width the native 64bit. One port will be for writing data from the CPU, on the CPU clock domain. The second port will be used for reading the data out at a rate which is dictated by the UART serial baud – much, much, slower. Doing this will ensure tracing will not impact execution of the core at any point, meaning our dumped data is much more valuable.

Lastly, we want to trigger these dumps at a point in time when we think an issue has occurred. Two immediate trigger types come to mind in addition to a manual button.

  1. Memory address
  2. Comparison with the data which is to be dumped; i.e, pipeline status flags combined with instruction types.


The implementation is very simple. I’ve added a debug signal output to the CPU core entity. It’s 64 bits of data consisting of 32 bits of status bits, and a 32-bit data value as defined below.

This data is always being output by the core, changing every cycle. The data value can be various things; the PC when in a STAGE_FETCH state, the ALU result, the value we’re writing to rD in WRITEBACK, or a memory location during a load/store.

We only need two new processes for the system:

  • trace_streamout: manages the streaming out of bytes from the trace block ram
  • trace_en_check: inspects trigger conditions in order to initiate a trace dump which trace_streamout will handle

The BRAM used as the circular trace buffer is configured as 64-bits word length, with 4096 addresses. It was created using the Block Memory Generator, and has a read latency of 2 cycles.

We will use a clock cycle counter which already exists for dictating write locations into the BRAM. As it’s used as a circular buffer, we simply take the lower 12 bits of the clock counter as address into the BRAM.

Port A of the BRAM is the write port, with it’s address line tied to the bits noted above. It is enabled by a signal only when the trace_streamout process is idle. This is so when we do stream out the data we want, it’s not polluted with new data while our slow streamout to UART is active. That new data is effectively lost. As this port captures the cpu core O_DBG output, it’s clocked at the CPU core clock.

Port B is the read port. It’s clocked using the 100MHz reference clock (which also drives the UART – albeit then subsampled via a baud tick). It’s enabled when a streamout state is requested, and reads an address dictated by the trace_streamout process.

The trace_streamout process, when the current streamout state is idle, checks for a dump_enable signal. Upon seeing this signal, the last write address is latched from the lower cycle counter 12 bits. We also set a streamout location to be that last write +1. This location is what is fed into Port B of the BRAM/ circular trace buffer. When we change the read address on port B, we wait some cycles for the value to properly propagate out. During this preload stall, we also wait for the UART TX to become ready for more data. The transmission is performed significantly slower than the clock that trace_streamout runs at, and we cannot write to the TX buffer if it’s full.

The UART I’m using is provided by Xilinx and has an internal 16-byte buffer. We wait for a ready signal as then we know that writing our 8 bytes of debug data (remember, 64-bit) quickly into the UART TX will succeed. In addition to the 8 bytes of data, I also send 2 bytes of magic number data at the start of every 64-bit packet as an aid to the receiving logic; we can check the first two bytes for these values to ensure we’re synced correctly in order to parse the data eventually.

After the last byte is written, we increment our streamout location address. If it’s not equal to the last write address we latched previously, we move to the preload stall and move the next 8 bytes of trace data out. Otherwise, we are finished transmitting the entire trace buffer, so set out state back to idle and re-enable new trace data writes.

Triggering streamout

Triggering a dump using dump_enable can be done a variety of ways. I have a physical push-button on my Arty S7 board set to always enable a dump, which is useful to know where execution currently is in a program. I have also got a trigger on reading a certain memory address. This is good if there is an issue triggering an error which you can reliably track to a branch of code execution. Having a memory address in that code branch used as trigger will dump the cycles leading up to that branch being taken. There are one other types of trigger – relying on the cpu O_DBG signal itself, for example, triggering a dump when we encounter an decoder interrupt for an invalid instruction.

I hard-code these triggers in the VHDL currently, but it’s feasible that these can be configurable programmatically. The dump itself could also be triggered via a write to a specific MMIO location.

Parsing the data on the Debug PC

The UART TX on the FPGA is connected to the FTDI USB-UART bridge, which means when the FPGA design is active and the board is connected via USB, we can just open the COM port exposed via the USB device.

I made a simple C# command line utility which just dumps the packets in a readable form. It looks like this:

[22:54:19.6133781]Trace Packet, 00000054,  0xC3 40 ,   OPCODE_BRANCH ,     STAGE_FETCH , 0x000008EC INT_EN , :
[22:54:19.6143787]Trace Packet, 00000055,  0xD1 40 ,   OPCODE_BRANCH ,    STAGE_DECODE , 0x04C12083 INT_EN , :
[22:54:19.6153795]Trace Packet, 00000056,  0xE1 40 ,     OPCODE_LOAD ,       STAGE_ALU , 0x00000001 INT_EN , :
[22:54:19.6163794]Trace Packet, 00000057,  0xF1 C0 ,     OPCODE_LOAD ,    STAGE_MEMORY , 0x0000476C REG_WR  INT_EN , :
[22:54:19.6183798]Trace Packet, 00000058,  0x01 C0 ,     OPCODE_LOAD ,    STAGE_MEMORY , 0x0000476C REG_WR  INT_EN , :
[22:54:19.6183798]Trace Packet, 00000059,  0x11 C0 ,     OPCODE_LOAD ,    STAGE_MEMORY , 0x0000476C REG_WR  INT_EN , :
[22:54:19.6193799]Trace Packet, 00000060,  0x20 C0 ,     OPCODE_LOAD ,    STAGE_MEMORY , 0x0000476C REG_WR  INT_EN , :
[22:54:19.6203802]Trace Packet, 00000061,  0x31 C0 ,     OPCODE_LOAD ,    STAGE_MEMORY , 0x0000476C REG_WR  INT_EN , :
[22:54:19.6213808]Trace Packet, 00000062,  0x43 C0 ,     OPCODE_LOAD ,    STAGE_MEMORY , 0x0000476C REG_WR  INT_EN , :
[22:54:19.6213808]Trace Packet, 00000063,  0x51 C0 ,     OPCODE_LOAD , STAGE_WRITEBACK , 0x00001CDC REG_WR  INT_EN , :

You can see some data given by the utility such as timestamps and a packet ID. Everything else is derived from flags in the trace data for that cycle.

Later I added some additional functionality, like parsing register destinations and outputting known register/memory values to aid when going over the output.

[22:54:19.6213808]Trace Packet, 00000062,  0x43 C0 ,     OPCODE_LOAD ,    STAGE_MEMORY , 0x0000476C REG_WR  INT_EN , :
[22:54:19.6213808]Trace Packet, 00000063,  0x51 C0 ,     OPCODE_LOAD , STAGE_WRITEBACK , 0x00001CDC REG_WR  INT_EN , :
  MEMORY 0x0000476C = 0x00001CDC
  REGISTER ra = 0x00001CDC

I have also been working on a rust-based GUI debugger for these trace files, where you can look at known memory (usually the stack) and register file contents at a given packet by walking the packets up until the point you’re interested in. It was an excuse to get to know Rust a bit more, but it’s not completely functional and I use the command line C# version more.

The easiest use for this is the physical button for dumping the traces. When bringing up some new software on the SoC it rarely works first time and end up in an infinite loop of some sort. Using the STAGE_FETCH packets which contain the PC I can look to an objdump and see immediately where we are executing without impacting upon the execution of the code itself.

Using the data to debug issues

Now to spoil a bit of the upcoming RPU Interrupts/Zephyr post with an example of how these traces have helped me. But I think an example of a real problem the trace dumps helped solve is required.

After implementing external timer interrupts, invalid instruction interrupts, system calls – and fixed a ton of issues – I had the Zephyr Dining Philosophers sample running on RPU in all it’s threaded, synchronized, glory.

Why do I need invalid instruction interrupts? Because RPU does not implement the M RISC-V extension. So multiply and divide hardware does not exist. Sadly, somewhere in the Zephyr build system, there is assembly with mul and div instructions. I needed invalid instruction interrupts in order to trap into an exception handler which could software emulate the instruction, write the result back into the context, so that when we returned from the interrupt to PC+4 the new value for the destination register would be written back.

It’s pretty funny to think that for me, implementing that was easier than trying to fix a build system to compile for the architecture intended.

Anyway, I was performing long-running tests of dining philosophers, when I hit the fatal error exception handler for trying to emulate an instruction it didn’t understand. I was able to replicate it, but it could take hours of running before it happened. The biggest issue? The instruction we were trying to emulate was at PC 0x00000010 – the start of the exception handler!

So, I set up the CPU trace trigger to activate on the instruction that branches to print that “FATAL: Reg is bad” message, started the FPGA running, and left the C# app to capture any trace dumps. After a few hours the issue occurred, and we had our CPU trace of the 4096 cycles leading up to the fatal error. Some hundreds of cycles before the dump initiated, we have the following output.

Packet 00,    STAGE_FETCH , 0x00000328 REG_WR  INT_EN   , :
Packet 01,    STAGE_FETCH , 0x00000328 REG_WR  INT_EN  EXT_INT   , :
Packet 02,    STAGE_FETCH , 0x00000328 REG_WR  INT_EN  LINT_INT  EXT_INT  EXT_INT_ACK   , :
Packet 03,    STAGE_FETCH , 0x00000328 REG_WR  INT_EN  LINT_INT  EXT_INT_ACK   , :
Packet 04,   STAGE_DECODE , 0x02A5D5B3 REG_WR  INT_EN  LINT_INT  EXT_INT_ACK   , :
Packet 09,    STAGE_STALL , 0x00000010 LINT_INT  DECODE_INT  IS_ILLEGAL_INST , :
Packet 10,    STAGE_FETCH , 0x00000010 LINT_INT  DECODE_INT  IS_ILLEGAL_INST , :
Packet 11,    STAGE_FETCH , 0x00000010 LINT_INT  DECODE_INT  IS_ILLEGAL_INST , :
Packet 12,    STAGE_FETCH , 0x00000010 DECODE_INT  IS_ILLEGAL_INST , :
Packet 17,      STAGE_ALU , 0x00000002 INT_EN  LINT_INT  DECODE_INT_ACK  IS_ILLEGAL_INST , :
Packet 25,    STAGE_FETCH , 0x00000078, :

What on earth is happening here? This is a lesson as to why interrupts have priorities 🙂

Packet 00,    FETCH, 0x00000328 REG_WR 
Packet 01,    FETCH, 0x00000328 REG_WR                     EXT_INT 
Packet 02,    FETCH, 0x00000328 REG_WR           LINT_INT  EXT_INT  EXT_INT_ACK 
Packet 03,    FETCH, 0x00000328 REG_WR           LINT_INT           EXT_INT_ACK 
Packet 05,      ALU, 0x0000000B REG_WR           LINT_INT           EXT_INT_ACK  DECODE_INT                  IS_ILLEGAL_INST
Packet 06,WRITEBACK, 0xCDC1FEF1                  LINT_INT           EXT_INT_ACK  DECODE_INT                  IS_ILLEGAL_INST
Packet 07,    STALL, 0x0000032C        LINT_RST  LINT_INT           EXT_INT_ACK  DECODE_INT                  IS_ILLEGAL_INST
Packet 08,    STALL, 0x0000032C        LINT_RST  LINT_INT                        DECODE_INT                  IS_ILLEGAL_INST
Packet 09,    STALL, 0x00000010                  LINT_INT                        DECODE_INT                  IS_ILLEGAL_INST
Packet 12,    FETCH, 0x00000010                                                  DECODE_INT                  IS_ILLEGAL_INST
Packet 13,    FETCH, 0x00000010                  LINT_INT                        DECODE_INT  DECODE_INT_ACK  IS_ILLEGAL_INST
Packet 14,    FETCH, 0x00000010                  LINT_INT                                    DECODE_INT_ACK  IS_ILLEGAL_INST 
Packet 16,   DECODE, 0xFB010113                  LINT_INT                                    DECODE_INT_ACK  IS_ILLEGAL_INST
Packet 17,      ALU, 0x00000002                  LINT_INT                                    DECODE_INT_ACK  IS_ILLEGAL_INST
Packet 18,WRITEBACK, 0x00004F60 REG_WR           LINT_INT                                    DECODE_INT_ACK  IS_ILLEGAL_INST
Packet 19,    STALL, 0x00000014 REG_WR LINT_RST  LINT_INT                                    DECODE_INT_ACK  IS_ILLEGAL_INST
Packet 20,    STALL, 0x00000014 REG_WR LINT_RST  LINT_INT                                                    IS_ILLEGAL_INST
Packet 21,    STALL, 0xABCDEF01 REG_WR           LINT_INT                                                    IS_ILLEGAL_INST
Packet 24,    FETCH, 0xABCDEF01 REG_WR                                                                       IS_ILLEGAL_INST
Packet 25,    FETCH, 0x00000078

I’ve tried to reduce the trace down to minimum and lay it out so it makes sense. There are a few things you need to know about the RPU exception system which have yet to be discussed:

Each Core has a Local Interrupt Controller (LINT) which can accept interrupts at any stage of execution, provide the ACK signal to let the requester know it’s been accepted, and then at a safe point pass it on to the Control Unit to initiate transfer of execution to the exception vector. This transfer can only happen after a writeback, hence the STALL stages as it’s set up before fetching the first instruction of the exception vector at 0x00000010. If the LINT sees external interrupts requests (EXT_INT – timer interrupts) at the same time as decoder interrupts for invalid instruction, it will always choose the decoder above anything – as that needs immediately handled.

And here is what happens above:

  1. We are fetching PC 0x00000328, which happens to be an unsupported instruction which will be emulated by our invalid instruction handler.
  2. As we are fetching, and external timer interrupt fires (Packet 01)
  3. The LINT acknoledges the external interrupt as there is no higher priority request pending, and signals to the control unit an int is pending LINT_INT (Packet 2)
  4. As we wait for the WRITEBACK phase for the control unit to transfer to exception vector, PC 0x00000328 decodes as an illegal instruction and DECODER_INT is requested (Packet 5)
  5. LINT cannot acknowledge the decoder int as the control unit can only handle a single interrupt at a time, and its waiting to handle the external interrupt.
  6. The control unit accepts the external LINT_INT, and stalls for transfer to exception vector, and resets LINT so it can accept new requests (Packet 7).
  7. We start fetching the interrupt vector 0x00000010 (Packet 12)
  8. The LINT sees the DECODE_INT and immediately accepts and acknowledges.
  9. The control unit accepts the LINT_INT, stalls for transfer to exception vector, with the PC of the exception being set to 0x00000010 (Packet 20).
  10. Everything breaks, the PC get set to a value in flux, which just so happened to be in the exception vector (Packet 25).

In short, if an external interrupt fires during the fetch stage of an illegal instruction, the illegal instruction will not be handled correctly and state is corrupted.

Easily fixed with some further enable logic for external interrupts to only be accepted after fetch and decode. But one hell is an issue to find without the CPU trace dumps!

Finishing up

So, as you can see, trace dumps are an great feature to have in RPU. A very simple implementation can yield enough information to work with on problems where the simulator just is not viable. With different trigger options, and the ability to customize the O_DBG signal to further narrow down issues under investigation, it’s invaluable. In fact, I’ll probably end up putting this system into any similarly complex FPGA project in the future. The HDL will shortly be submitted to the SoC github repo along with the updated core which supports interrupts.

Thanks for reading, as always – feel free to ask any follow up questions to me on twitter @domipheus.

The posts about how the interrupts were fully integrated into RPU are on their way!

Designing a RISC-V CPU in VHDL, Part 18: Control and Status Register Unit

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

You may remember after the switch from my own TPU ISA to RISC-V was made, I stated that interrupts were disabled. This was due to requirements for RISC-V style interrupt mechanisms not being compatible with what I’d done on TPU. In order to get Interrupts back into RPU, we need to go and implement the correct Control and Status Registers(CSRs) required for management of interrupts – enable bits, interrupt vectors, interrupt causes – they are all communicated via CSRs. As I want my core to execute 3rd-party code, this needs to be to spec!

In the time it’s taken me to start writing this article after doing the implementation, the latest draft RISC-V spec has moved the CSR instructions and definitions into it’s own Zicsr ISA extension. The extension defines 6 additional instructions for manipulating and querying the contents of these special registers, registers which define things such as privilege levels, interrupt causes, and timers. Timers are a bit of an oddity, which we’ll get back to later.

The CSRs within a RISC-V hardware thread (known as a hart) can influence the execution pipeline at many stages. If interrupts are enabled via enable bits contained in CSRs, additional checks are required after instruction writeback/retire in order to follow any pending interrupt or trap. One very important fact to take away from this is that the current values of particular CSRs are required to be known in many locations at multiple points in the instruction pipeline. This is a very real issue, complicated by differing privilege levels having their own local versions of these registers, so throughout designing the implementation of the RPU CSR unit, I wanted a very simple solution which worked well on my unpipelined core – not necessarily the best or most efficient design. This unit has been rather difficult to create, and the solution may be far from ideal. Of all the units I’ve designed for RPU, this is the one which I am continually questioning and looking for redesign possibilities.

The CSRs are accessed via a flat address space of 4096 entries. On our RV32I machine, the entries will be 32 bits wide. Great; lets use a block ram I hear you scream. Well, it’s not that simple. For RPU, the amount of CSRs we require is significantly less than the 4096 entries – and implementing all of them is not required. Whilst I’m sure some people will want 16KB of super fast storage for some evil code optimizations, we are very far from looking at optimizing RPU for speed – so, the initial implementation will look as follows:

  • The CSR unit will take operations as input, with data and CSR address.
  • A signal exists internally for each CSR we support.
  • The CSR unit will output current values of internal CSRs as required for CPU execution at the current privilege level.
  • The CSR unit will be multi-cycle in nature.
  • The CSR unit can raise exceptions for illegal access.

So, our unit will have signals for use when executing user CSR instructions, signals for notifying of events which auto-update registers, output signals for various CSR values needed mid-pipeline for normal CPU operation, and finally – some signals for raising exceptions – which will be needed later – when we check for situations like write operations to read-only CSR addresses.

The operations and CSR address will be provided by the decode stage of the pipeline, and take the 6 different CSR manipulation instructions and encode that into specific CSR unit control operations.

You can see from the spec that the CSR operations are listed as atomic. This does not really effect RPU with the current in-order pipeline, but may do in the future. The CSRRS and CSRRC (Atomic Read and [Set/Clear] Bits in CSR) instructions are read-modify-write, which instantly makes our CSR unit capable of multi-cycle operations. This will be the first pipeline stage that can take more than one cycle to execute, so is a bit of a milestone in complexity. Whilst there are only six instructions, specific values of rd and r1 can expand them to both read and write in the same operation.

All of the CSR instructions fall under the wide SYSTEM instruction type, so we add another set of checks in our decoder, and output translated CSR unit opcodes. The CSR Opcodes are generally just the funct3 instruction bits, with additional read/write flags. There are no real changes to the ALU – our CSR unit takes over any work for these operations.

The RISC-V privileged specification is where the real CSR listings exist, and the descriptions of their use. A few CSRs hold basic, read-only machine information such as vendor, architecture, implementation and hardware thread (hard) IDs. These CSRs are generally accessed only via the csr instructions, however some exist that need accessed at points in the pipeline. The Machine Status Register (mstatus) contains privilege mode and interrupt enable flags which the control unit of the CPU needs constant access to in order to make progress through the instruction pipe.

The CSR Unit

With the input and outputs requirements for the unit pretty clear, we can get started defining the basis for how our unit will be implemented. For RPU, we’ll have an internal signal for each relevant CSR, or manually hard-wire to values as required. Individual process blocks will be used for the various tasks required:

  • A “main” process block which handles any requested CSR operation.
  • A series of “update” processes which write automatically-updated CSRs, such as the instruction retired and cycle counters.
  • A “protection” process which will flag unauthorized access to CSR addresses.

At this point, we have not discussed unauthorized access to CSRs. The interrupt mechanism is not in place to handle them at this point in the blog, however, we will still check for them. Handily, the CSR address actually encodes read/write/privilege access requirements, defined by the following table:

So detecting and raising access exception signals when the operation is at odds with the current privilege level or address type is very simple. If it’s a write operation, and the address bits [11:10] == ’11’ then it’s an invalid access. Same with the privilege checks against the address bits in [9:8].

The “update” processes are, again, fairly simple.

The main process is defined as the following state machine.

As you can see, reads are quicker as they do not fully go into the modify/write phase. A full read-modify-write CSR operation will take 3 cycles. The result of the CSR read will be available 1 cycle later – the same latency as our current ALU operations. As the current latency of our pipeline means it’s over 2 cycles before the next CSR read could possibly occur, so we can have our unit work concurrently and not take up any more than the normal single cycle of our pipeline.

The diagram above shows the tightest possible pipeline timing for a CSR ReadModifyWrite, followed by a CSR read – and in reality the FETCH stage takes multiple cycles due to our memory latency, so currently it’s very safe to have the CSR unit run operations alongside the pipeline. As a quick aside, interrupts from here would only be raised on the first cycle when the CSR Operation was first seen, so they would be handled at the correct stages and not leak into subsequent instructions due to the unit running past the current set of pipeline stages.

A run in our simulation shows the expected behavior, with the write to CSR mtvec present after the 3 cycle write latency.

And the set bit operations also work:

Branching towards confusion

An issue arose after implementing the CSR unit into the existing pipeline, one which caused erroneous branching and forward progress when a CSR instruction immediately followed a branch. It took a while to track this down, although in hindsight with a better simulator testbench program it would have been obvious – which I will show:

Above is a shot of the simulator waveform, centered on a section of code with a branch instruction targeted at a CSR instruction. Instead of the instruction following the cycle CSR read being the cycleh read, there is a second execution of cycle CSR read, and then a branch to 0x00000038. The last signal in the waveform – shouldBranch – is the key here. The shouldBranch signal is controlled by the ALU, and in the first implementation of the CSR unit, the ALU was disabled completely if an CSR instruction was found by the decoder. This meant the shouldBranch signal was not reset after the previous branch (0x0480006f – j 7c) executed. What really needs to happen, is the ALU remain active, but understand that it has no side effects when a CSR operation is in place. Doing this change, results in the following – correct execution of the instruction sequence.


I’ve not really explained all of the CSR issues present in RISC-V during this part of the series. There is a whole section in the spec on Field Specifications, whereby bits within CSRs can behave as though there are no external side effects, however, bits can be hard wired to 0 on read. This is so that field edits which are currently reserved in an implementation will be able to run in future without unexpected consequences. Thankfully, while I did not go into the specifics here, my implementation allows for these operations to be implemented to spec. It’s just a matter of time.

This turned into quite a long one to write, on what I thought would be a fairly simple addition to the CPU. It was anything but simple! You can see the current VHDL code for the unit over on github.

Exceptions/Interrupts are coming next time, now that the relevant status registers are implemented. We’ll also touch on those pesky timer ‘CSRs’.

Thanks for reading, as always – feel free to ask any follow up questions to me on twitter @domipheus.