Designing a RISC-V CPU in VHDL, Part 20: Interrupts and Exceptions

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

Interrupts and exceptions are important events that any CPU needs to handle. The usual definition is that interrupts happen outside of the CPU – timer events, for example. Exceptions occur within the CPU, like trying to execute an invalid instruction. These events are handled all the time within a system, and whilst some signify faults and error conditions – most are just handling system functionality.

I mentioned earlier in the series how my previous CPU (TPU) had interrupt handling. However the way it was implemented needed significant modification to work in a RISC-V environment. RPU now supports many more types of exception/interrupt, and as such is more complex.

Before we go further, in the RPU code I use the term interrupt to refer to both interrupts and exceptions. Unless I explicitly mention exceptions it, assume I mean both types.

The Local Interrupt Unit

RPU will implement the timer interrupts as external, similar to how TPU did it. It will also support in invalid instruction, system calls, breakpoints, invalid CSR access (and ALU), and misaligned jump/memory. These generally fit into 4 categories:

  • Decoder Exceptions
  • CSR/ALU unit exceptions
  • Memory Exceptions, and
  • External interrupts

There are more subcategories to these, defined by an additional 32bit data value describing the cause further, but these 4 categories can fit nicely into 4 interrupt lines. The CPU can only handle one at a time, so with this in mind I created a Local Interrupt unit, the LINT unit, which will take all the various interrupt request and associated data lines, and decide which one actually makes its way into the control unit for handling. Internally, it is implemented as a simple conditional check of the different input categories, and then forwarding the data to the control unit, waiting for a acknowledge reset signal before going on to the next interrupt, if multiple were being requested at once. The LINT also handles ack/reset forwarding to the original input units.

With this unit complete, we can add an O_int and O_intData output, as well as an acknowledge input for reset, to our decoder unit. This will attempt to raise an exception and set the intData output to be the cause as defined by the RISC-V standard, which will let any interrupt handler know which kind of request – invalid instruction, ecall/system call, breakpoint – caused the exception.

The CSR unit from the previous part already has a facility to raise an exception – it can check the CSR op and address to ensure the operation is valid. For instance, attempting to write a read only CSR would raise an access exception. Whilst this is all implemented and connected up, the compliance suite of tests does not test access interrupts, so its not extensively tested. We will need to reinvestigate that once RPU is extended to fully support the different runtime privilege levels.

Memory exceptions from misaligned instruction fetch are found by testing the branch targets for having the 1st bit set. The 0th bit by the specification is always cleared, and we don’t support the compressed instruction set, so a simple check is all we need for this.

The load and store misaligned interrupts are found by testing memory addresses depending on request size and type. RPU will raise exceptions for any load or store on a non-naturally aligned address.

Lastly, external interrupts have the signal lines directly routed outside of the CPU core, so the SoC implementation can handle those. In the ArtyS7-RPU-SoC, timer interrupts are implemented via a 12MHz clock timer and compare register manipulated via MMIO. We could also implement things like UART receive data waiting interrupts through this.

Execution Control

Now, we know what can trigger interrupts in the CPU, but we need to lay down exactly the steps and dataflow required both when we enter an interrupt handler, and exit from it. The control unit handles this.

As a reminder – here is how a traditional external interrupt was handled when ported simply through to RPU from my old TPU project. You can see the interrupt had to wait until a point in the pipeline which was suitable, which is okay in this instance. However, exceptions require significant changes to the control unit flow.

Interrupt entry / exit

On a decision being made to branch to the interrupt vector – the location of which is stored in a CSR – several other CSR contents need modified:

  1. The previous interrupt enable bit is set to the current interrupt enable value.
  2. the interrupt enable bit is set to 0.
  3. the previous privilege mode is set to the current privilege
  4. the privilege mode is set to 11
  5. the mcause CSR is set to the interrupt data value
  6. the mepc CSR is set to the PC where the interrupt was encountered.
  7. the mvtval CSR is set to the location of any exception- specific data, like the address of a misaligned load.

On exit from an interrupt via mret, the previous enable and privilege values are restored. These csr manipulations will occur internal to the CSR unit, using int_exit and int_entry signals provided to it by the control unit.

The control unit

The previous TPU work implemented interrupts by checking for assertion of the interrupt signal at the end of the CPU pipeline, just before the fetch stage. This works fine for external interrupts, and it keeps complexity low, due to not having to pause the pipeline mid-execution. However, we now have different classes of interrupt which need mid-pipeline handling:

  • Decoder interrupts
  • CSR/ALU interrupts
  • Misalignment/Memory interrupts

For decoder interrupts, we can check for an INT signal from the decoder, and ensure any memory/register writes don’t occur a few cycles later. The misalignment interrupts can be triggered at fetch and memory pipeline stages and are more complex.

In the previous part of this series, where I added CPU trace support, I discussed some of the logic flow that a decoder interrupt takes. It followed on that how different types of interrupt have higher priority, and a priority system is needed. The LINT was supposed to handle this priority system – and did in general – in that, decoder exceptions are higher priority than external exceptions. However, the LINT has no concept of where execution is in the pipeline, and how at certain execution stages actually require certain exceptions are handled immediately, regardless of how many cycles previously another interrupt was requested. This required another rather clunky set of conditions be provided to the LINT unit as a set of enables bits for the various types. Some enables were only active during certain pipeline stages. My decision to not separate out how Interrupts (like external timers) and Exceptions (CPU internal faults that must be immediately handled) has bitten me here by requiring specific enable overrides, and some more workarounds.

Memory Misalignment Exceptions

I thought the memory misalignment exceptions would be a fairly simple addition, however they presented an interesting challenge, due to the fixed timing that is inherent within the memory/fetch parts of the core.

Discovering whether a misalignment exception should assert is fairly simple, we can have a process which checks addresses and asserts the relevant INT line with associated mcause value to indicate the type of exception:

  • load
  • store
  • branch

The LINT unit has a cycle of latency, and when it’s a memory exception we are talking about, that cycle of latency means it’s too late – the memory operation or fetch will have already been issued. The latency is acceptable for a decoder interrupt, because the writeback phase is still 2 cycles away and the interrupt will be handled by that point, avoiding register file side-effects.

The side effects of an invalid memory operation are hard to track, so instead we forward a hint to the control unit that a misalignment exception is likely to occur shortly. It’s rather clumsy, but this hint gets around the LINT latency and allows the control unit to stall any memory operations. This stall is just long enough for the operations to not issue, and the exception be successfully handled once the LINT responds.

These stalls are implemented by a cycle counter in the control unit, counting down a fixed number of stall cycles if a misalignment hint is seen. During each of these cycles, the interrupt signal is checked in case we need to jump to the handler. The control unit is definitely complicated significantly by these checks. There is a lot of copy and paste nonsense going on here.

Lastly, to further complicate things; my memory-mapped IO range of addresses (currently 0xF0000000 – 0xFFFFFFFF) has various addresses which I write from the bootloader firmware in an unaligned way. To fix this I’ve excluded this memory range from the misalignment checks. I’ll fix it another time.

So now we have this align_hint shortcut which bypasses the LINT and allows for correct handling of the various memory exceptions.

Further Bugs

An issue that was discovered during a simulation of the decoder interrupts specifically for ecall and ebreak. The method for acknowledging interrupt requests was that the LINT would assert an ACK signal for the unit it selected, and that unit would then de-assert it’s INT signal. The problem is that I’d placed this de-assertion insider the rest of the decoder handling process, which was only active when the pipeline decoder enable was active. By the time the LINT wanted to acknowledge the decoder interrupt, the enable was not active. This resulted in an infinite loop of interrupt requests from the decoder, which was not much use!

Another bug was that I was latching the wrong PC value to the mepc register. This was not caught until I started debugging actual code, but would have shown up pretty obviously in the simulator. The fix was to not grab the current PC value for mepc but to latch the correct value at the time the interrupt was fired.

Lastly, as I was testing the riscv-compliance misalignment exception test, I realised an exception was being raised when it shouldn’t. Turns out I had missed a point in the ISA spec, whereby jump branch targets always have bit 0 masked off. An easy thing to fix, but annoying I had this bug for so long.

But, RPU now passes risc-v compliance 🙂

Summing up

With the interrupt support now in the CPU design, I have realised just how many mistakes I have made in this area. Not separating out different types for interrupts and exceptions (which need handled immediately in the pipeline) meant the LINT needed these ugly overriding hints for the control unit in order to operate correctly. It all seems a bit messy.

However, it does work. I can fix its design later, now I have a working implementation to base changes off of.

Thanks for reading, as always – feel free to ask any follow up questions to me on twitter @domipheus.

Designing a RISC-V CPU in VHDL, Part 19: Adding Trace Dump Functionality

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

For those who follow me on twitter, you’ll have seen my recent tweets regarding Zephyr OS running on RPU. This was a huge amount of work to get running, most of it debugging on the FPGA itself. For those new to FPGA development, trying to debug on-chip can be a very difficult and frustrating experience. Generally, you want to debug in the simulator – but when potential issues are influenced by external devices such as SD cards, timer interrupts, and hundreds of millions of cycles into the boot process of an operating system – simulators may not be feasible.

Blog posts on the features I added to RPU to enable Zephyr booting, such as proper interrupts, exceptions and timers are coming – but it would not have been possible without a feature of the RPU SoC I have not yet discussed.

CPU Tracing

Most real processors will have hardware features built in, and one of the most useful low-level tools is tracing. This is when at an arbitrary time slice, low level details on the inner operation of the core are captured into some buffer, before being streamed elsewhere for analysis and state reconstruction later.

Note that this is a one-way flow of data. It is not interactive, like the debugging most developers know. It is mostly used for performance profiling but for RPU would be an ideal debugging aid.

Requirements

For the avoidance of doubt; I’m defining “A Trace” to be one block of valid data which is dumped to a host PC for analysis. For us, dumping will be streaming the data out via UART to a development PC. Multiple traces can be taken, but when the data transfer is initiated, the data needs to be a real representation of what occurred immediately preceding the request to dump the trace. The data contained in a trace is always being captured on the device in order that if a request is made, the data is available.

These requirements require a circular buffer which is continually recording the state. I’ll define exactly what the data is later – but for now, the data is defined as 64-bits per cycle. Plenty for a significant amount of state to be recorded, which will be required in order to perform meaningful analysis. We have a good amount of block rams on our Spartan 7-50 FPGA, so we can dedicate 32KB to this circular buffer quite easily. 64-bits into 32KB gives us 4,096 cycles of data. Not that much you’d think for a CPU running at over 100MHz, but you’d be surprised how quickly RPU falls over when it gets into an invalid state!

It goes without saying that our implementation needs to be non-intrusive. I’m not currently using the UART connected to the FTDI USB controller, as our logging output is displayed graphically via a text-mode display over HDMI. We can use this without impacting existing code. Our CPU core will expose a debug trace bus signal, which will be the data captured.

We’ve mentioned the buffer will be in a block ram; but one aspect of this is that we must be wary of the observer effect. This issue is very much an issue for performance profiling, as streaming out data from various devices usually goes through memory subsystems which will increase bandwidth requirements, and lead to more latency in the memory operations you are trying to trace. Our trace system should not effect the execution characteristics of the core at all. As we are using a development PC to receive the streamed data, we can completely segregate all data paths for the trace system, and remove the block ram from the memory mapped area which is currently used for code and data. With this block ram separate, we can ensure it’s set up as a true dual port ram with data width the native 64bit. One port will be for writing data from the CPU, on the CPU clock domain. The second port will be used for reading the data out at a rate which is dictated by the UART serial baud – much, much, slower. Doing this will ensure tracing will not impact execution of the core at any point, meaning our dumped data is much more valuable.

Lastly, we want to trigger these dumps at a point in time when we think an issue has occurred. Two immediate trigger types come to mind in addition to a manual button.

  1. Memory address
  2. Comparison with the data which is to be dumped; i.e, pipeline status flags combined with instruction types.

Implementation

The implementation is very simple. I’ve added a debug signal output to the CPU core entity. It’s 64 bits of data consisting of 32 bits of status bits, and a 32-bit data value as defined below.

This data is always being output by the core, changing every cycle. The data value can be various things; the PC when in a STAGE_FETCH state, the ALU result, the value we’re writing to rD in WRITEBACK, or a memory location during a load/store.

We only need two new processes for the system:

  • trace_streamout: manages the streaming out of bytes from the trace block ram
  • trace_en_check: inspects trigger conditions in order to initiate a trace dump which trace_streamout will handle

The BRAM used as the circular trace buffer is configured as 64-bits word length, with 4096 addresses. It was created using the Block Memory Generator, and has a read latency of 2 cycles.

We will use a clock cycle counter which already exists for dictating write locations into the BRAM. As it’s used as a circular buffer, we simply take the lower 12 bits of the clock counter as address into the BRAM.

Port A of the BRAM is the write port, with it’s address line tied to the bits noted above. It is enabled by a signal only when the trace_streamout process is idle. This is so when we do stream out the data we want, it’s not polluted with new data while our slow streamout to UART is active. That new data is effectively lost. As this port captures the cpu core O_DBG output, it’s clocked at the CPU core clock.

Port B is the read port. It’s clocked using the 100MHz reference clock (which also drives the UART – albeit then subsampled via a baud tick). It’s enabled when a streamout state is requested, and reads an address dictated by the trace_streamout process.

The trace_streamout process, when the current streamout state is idle, checks for a dump_enable signal. Upon seeing this signal, the last write address is latched from the lower cycle counter 12 bits. We also set a streamout location to be that last write +1. This location is what is fed into Port B of the BRAM/ circular trace buffer. When we change the read address on port B, we wait some cycles for the value to properly propagate out. During this preload stall, we also wait for the UART TX to become ready for more data. The transmission is performed significantly slower than the clock that trace_streamout runs at, and we cannot write to the TX buffer if it’s full.

The UART I’m using is provided by Xilinx and has an internal 16-byte buffer. We wait for a ready signal as then we know that writing our 8 bytes of debug data (remember, 64-bit) quickly into the UART TX will succeed. In addition to the 8 bytes of data, I also send 2 bytes of magic number data at the start of every 64-bit packet as an aid to the receiving logic; we can check the first two bytes for these values to ensure we’re synced correctly in order to parse the data eventually.

After the last byte is written, we increment our streamout location address. If it’s not equal to the last write address we latched previously, we move to the preload stall and move the next 8 bytes of trace data out. Otherwise, we are finished transmitting the entire trace buffer, so set out state back to idle and re-enable new trace data writes.

Triggering streamout

Triggering a dump using dump_enable can be done a variety of ways. I have a physical push-button on my Arty S7 board set to always enable a dump, which is useful to know where execution currently is in a program. I have also got a trigger on reading a certain memory address. This is good if there is an issue triggering an error which you can reliably track to a branch of code execution. Having a memory address in that code branch used as trigger will dump the cycles leading up to that branch being taken. There are one other types of trigger – relying on the cpu O_DBG signal itself, for example, triggering a dump when we encounter an decoder interrupt for an invalid instruction.

I hard-code these triggers in the VHDL currently, but it’s feasible that these can be configurable programmatically. The dump itself could also be triggered via a write to a specific MMIO location.

Parsing the data on the Debug PC

The UART TX on the FPGA is connected to the FTDI USB-UART bridge, which means when the FPGA design is active and the board is connected via USB, we can just open the COM port exposed via the USB device.

I made a simple C# command line utility which just dumps the packets in a readable form. It looks like this:

[22:54:19.6133781]Trace Packet, 00000054,  0xC3 40 ,   OPCODE_BRANCH ,     STAGE_FETCH , 0x000008EC INT_EN , :
[22:54:19.6143787]Trace Packet, 00000055,  0xD1 40 ,   OPCODE_BRANCH ,    STAGE_DECODE , 0x04C12083 INT_EN , :
[22:54:19.6153795]Trace Packet, 00000056,  0xE1 40 ,     OPCODE_LOAD ,       STAGE_ALU , 0x00000001 INT_EN , :
[22:54:19.6163794]Trace Packet, 00000057,  0xF1 C0 ,     OPCODE_LOAD ,    STAGE_MEMORY , 0x0000476C REG_WR  INT_EN , :
[22:54:19.6183798]Trace Packet, 00000058,  0x01 C0 ,     OPCODE_LOAD ,    STAGE_MEMORY , 0x0000476C REG_WR  INT_EN , :
[22:54:19.6183798]Trace Packet, 00000059,  0x11 C0 ,     OPCODE_LOAD ,    STAGE_MEMORY , 0x0000476C REG_WR  INT_EN , :
[22:54:19.6193799]Trace Packet, 00000060,  0x20 C0 ,     OPCODE_LOAD ,    STAGE_MEMORY , 0x0000476C REG_WR  INT_EN , :
[22:54:19.6203802]Trace Packet, 00000061,  0x31 C0 ,     OPCODE_LOAD ,    STAGE_MEMORY , 0x0000476C REG_WR  INT_EN , :
[22:54:19.6213808]Trace Packet, 00000062,  0x43 C0 ,     OPCODE_LOAD ,    STAGE_MEMORY , 0x0000476C REG_WR  INT_EN , :
[22:54:19.6213808]Trace Packet, 00000063,  0x51 C0 ,     OPCODE_LOAD , STAGE_WRITEBACK , 0x00001CDC REG_WR  INT_EN , :

You can see some data given by the utility such as timestamps and a packet ID. Everything else is derived from flags in the trace data for that cycle.

Later I added some additional functionality, like parsing register destinations and outputting known register/memory values to aid when going over the output.

[22:54:19.6213808]Trace Packet, 00000062,  0x43 C0 ,     OPCODE_LOAD ,    STAGE_MEMORY , 0x0000476C REG_WR  INT_EN , :
[22:54:19.6213808]Trace Packet, 00000063,  0x51 C0 ,     OPCODE_LOAD , STAGE_WRITEBACK , 0x00001CDC REG_WR  INT_EN , :
  MEMORY 0x0000476C = 0x00001CDC
  REGISTER ra = 0x00001CDC

I have also been working on a rust-based GUI debugger for these trace files, where you can look at known memory (usually the stack) and register file contents at a given packet by walking the packets up until the point you’re interested in. It was an excuse to get to know Rust a bit more, but it’s not completely functional and I use the command line C# version more.

The easiest use for this is the physical button for dumping the traces. When bringing up some new software on the SoC it rarely works first time and end up in an infinite loop of some sort. Using the STAGE_FETCH packets which contain the PC I can look to an objdump and see immediately where we are executing without impacting upon the execution of the code itself.

Using the data to debug issues

Now to spoil a bit of the upcoming RPU Interrupts/Zephyr post with an example of how these traces have helped me. But I think an example of a real problem the trace dumps helped solve is required.

After implementing external timer interrupts, invalid instruction interrupts, system calls – and fixed a ton of issues – I had the Zephyr Dining Philosophers sample running on RPU in all it’s threaded, synchronized, glory.

Why do I need invalid instruction interrupts? Because RPU does not implement the M RISC-V extension. So multiply and divide hardware does not exist. Sadly, somewhere in the Zephyr build system, there is assembly with mul and div instructions. I needed invalid instruction interrupts in order to trap into an exception handler which could software emulate the instruction, write the result back into the context, so that when we returned from the interrupt to PC+4 the new value for the destination register would be written back.

It’s pretty funny to think that for me, implementing that was easier than trying to fix a build system to compile for the architecture intended.

Anyway, I was performing long-running tests of dining philosophers, when I hit the fatal error exception handler for trying to emulate an instruction it didn’t understand. I was able to replicate it, but it could take hours of running before it happened. The biggest issue? The instruction we were trying to emulate was at PC 0x00000010 – the start of the exception handler!

So, I set up the CPU trace trigger to activate on the instruction that branches to print that “FATAL: Reg is bad” message, started the FPGA running, and left the C# app to capture any trace dumps. After a few hours the issue occurred, and we had our CPU trace of the 4096 cycles leading up to the fatal error. Some hundreds of cycles before the dump initiated, we have the following output.

Packet 00,    STAGE_FETCH , 0x00000328 REG_WR  INT_EN   , :
Packet 01,    STAGE_FETCH , 0x00000328 REG_WR  INT_EN  EXT_INT   , :
Packet 02,    STAGE_FETCH , 0x00000328 REG_WR  INT_EN  LINT_INT  EXT_INT  EXT_INT_ACK   , :
Packet 03,    STAGE_FETCH , 0x00000328 REG_WR  INT_EN  LINT_INT  EXT_INT_ACK   , :
Packet 04,   STAGE_DECODE , 0x02A5D5B3 REG_WR  INT_EN  LINT_INT  EXT_INT_ACK   , :
Packet 05,      STAGE_ALU , 0x0000000B REG_WR  INT_EN  LINT_INT  EXT_INT_ACK  DECODE_INT  IS_ILLEGAL_INST , :
Packet 06,STAGE_WRITEBACK , 0xCDC1FEF1 INT_EN  LINT_INT  EXT_INT_ACK  DECODE_INT  IS_ILLEGAL_INST , :
Packet 07,    STAGE_STALL , 0x0000032C INT_EN  LINT_RST  LINT_INT  EXT_INT_ACK  DECODE_INT  IS_ILLEGAL_INST , :
Packet 08,    STAGE_STALL , 0x0000032C INT_EN  LINT_RST  LINT_INT  DECODE_INT  IS_ILLEGAL_INST , :
Packet 09,    STAGE_STALL , 0x00000010 LINT_INT  DECODE_INT  IS_ILLEGAL_INST , :
Packet 10,    STAGE_FETCH , 0x00000010 LINT_INT  DECODE_INT  IS_ILLEGAL_INST , :
Packet 11,    STAGE_FETCH , 0x00000010 LINT_INT  DECODE_INT  IS_ILLEGAL_INST , :
Packet 12,    STAGE_FETCH , 0x00000010 DECODE_INT  IS_ILLEGAL_INST , :
Packet 13,    STAGE_FETCH , 0x00000010 INT_EN  LINT_INT  DECODE_INT  DECODE_INT_ACK  IS_ILLEGAL_INST , :
Packet 14,    STAGE_FETCH , 0x00000010 INT_EN  LINT_INT  DECODE_INT_ACK  IS_ILLEGAL_INST , :
Packet 15,    STAGE_FETCH , 0x00000010 INT_EN  LINT_INT  DECODE_INT_ACK  IS_ILLEGAL_INST , :
Packet 16,   STAGE_DECODE , 0xFB010113 INT_EN  LINT_INT  DECODE_INT_ACK  IS_ILLEGAL_INST , :
Packet 17,      STAGE_ALU , 0x00000002 INT_EN  LINT_INT  DECODE_INT_ACK  IS_ILLEGAL_INST , :
Packet 18,STAGE_WRITEBACK , 0x00004F60 REG_WR  INT_EN  LINT_INT  DECODE_INT_ACK  IS_ILLEGAL_INST , :
Packet 19,    STAGE_STALL , 0x00000014 REG_WR  INT_EN  LINT_RST  LINT_INT  DECODE_INT_ACK  IS_ILLEGAL_INST , :
Packet 20,    STAGE_STALL , 0x00000014 REG_WR  INT_EN  LINT_RST  LINT_INT  IS_ILLEGAL_INST , :
Packet 21,    STAGE_STALL , 0xABCDEF01 REG_WR  INT_EN  LINT_INT  IS_ILLEGAL_INST , :
Packet 22,    STAGE_FETCH , 0xABCDEF01 REG_WR  INT_EN  LINT_INT  IS_ILLEGAL_INST , :
Packet 23,    STAGE_FETCH , 0xABCDEF01 REG_WR  INT_EN  LINT_INT  IS_ILLEGAL_INST , :
Packet 24,    STAGE_FETCH , 0xABCDEF01 REG_WR  IS_ILLEGAL_INST , :
Packet 25,    STAGE_FETCH , 0x00000078, :

What on earth is happening here? This is a lesson as to why interrupts have priorities 🙂

Packet 00,    FETCH, 0x00000328 REG_WR 
Packet 01,    FETCH, 0x00000328 REG_WR                     EXT_INT 
Packet 02,    FETCH, 0x00000328 REG_WR           LINT_INT  EXT_INT  EXT_INT_ACK 
Packet 03,    FETCH, 0x00000328 REG_WR           LINT_INT           EXT_INT_ACK 
Packet 05,      ALU, 0x0000000B REG_WR           LINT_INT           EXT_INT_ACK  DECODE_INT                  IS_ILLEGAL_INST
Packet 06,WRITEBACK, 0xCDC1FEF1                  LINT_INT           EXT_INT_ACK  DECODE_INT                  IS_ILLEGAL_INST
Packet 07,    STALL, 0x0000032C        LINT_RST  LINT_INT           EXT_INT_ACK  DECODE_INT                  IS_ILLEGAL_INST
Packet 08,    STALL, 0x0000032C        LINT_RST  LINT_INT                        DECODE_INT                  IS_ILLEGAL_INST
Packet 09,    STALL, 0x00000010                  LINT_INT                        DECODE_INT                  IS_ILLEGAL_INST
Packet 12,    FETCH, 0x00000010                                                  DECODE_INT                  IS_ILLEGAL_INST
Packet 13,    FETCH, 0x00000010                  LINT_INT                        DECODE_INT  DECODE_INT_ACK  IS_ILLEGAL_INST
Packet 14,    FETCH, 0x00000010                  LINT_INT                                    DECODE_INT_ACK  IS_ILLEGAL_INST 
Packet 16,   DECODE, 0xFB010113                  LINT_INT                                    DECODE_INT_ACK  IS_ILLEGAL_INST
Packet 17,      ALU, 0x00000002                  LINT_INT                                    DECODE_INT_ACK  IS_ILLEGAL_INST
Packet 18,WRITEBACK, 0x00004F60 REG_WR           LINT_INT                                    DECODE_INT_ACK  IS_ILLEGAL_INST
Packet 19,    STALL, 0x00000014 REG_WR LINT_RST  LINT_INT                                    DECODE_INT_ACK  IS_ILLEGAL_INST
Packet 20,    STALL, 0x00000014 REG_WR LINT_RST  LINT_INT                                                    IS_ILLEGAL_INST
Packet 21,    STALL, 0xABCDEF01 REG_WR           LINT_INT                                                    IS_ILLEGAL_INST
Packet 24,    FETCH, 0xABCDEF01 REG_WR                                                                       IS_ILLEGAL_INST
Packet 25,    FETCH, 0x00000078

I’ve tried to reduce the trace down to minimum and lay it out so it makes sense. There are a few things you need to know about the RPU exception system which have yet to be discussed:

Each Core has a Local Interrupt Controller (LINT) which can accept interrupts at any stage of execution, provide the ACK signal to let the requester know it’s been accepted, and then at a safe point pass it on to the Control Unit to initiate transfer of execution to the exception vector. This transfer can only happen after a writeback, hence the STALL stages as it’s set up before fetching the first instruction of the exception vector at 0x00000010. If the LINT sees external interrupts requests (EXT_INT – timer interrupts) at the same time as decoder interrupts for invalid instruction, it will always choose the decoder above anything – as that needs immediately handled.

And here is what happens above:

  1. We are fetching PC 0x00000328, which happens to be an unsupported instruction which will be emulated by our invalid instruction handler.
  2. As we are fetching, and external timer interrupt fires (Packet 01)
  3. The LINT acknoledges the external interrupt as there is no higher priority request pending, and signals to the control unit an int is pending LINT_INT (Packet 2)
  4. As we wait for the WRITEBACK phase for the control unit to transfer to exception vector, PC 0x00000328 decodes as an illegal instruction and DECODER_INT is requested (Packet 5)
  5. LINT cannot acknowledge the decoder int as the control unit can only handle a single interrupt at a time, and its waiting to handle the external interrupt.
  6. The control unit accepts the external LINT_INT, and stalls for transfer to exception vector, and resets LINT so it can accept new requests (Packet 7).
  7. We start fetching the interrupt vector 0x00000010 (Packet 12)
  8. The LINT sees the DECODE_INT and immediately accepts and acknowledges.
  9. The control unit accepts the LINT_INT, stalls for transfer to exception vector, with the PC of the exception being set to 0x00000010 (Packet 20).
  10. Everything breaks, the PC get set to a value in flux, which just so happened to be in the exception vector (Packet 25).

In short, if an external interrupt fires during the fetch stage of an illegal instruction, the illegal instruction will not be handled correctly and state is corrupted.

Easily fixed with some further enable logic for external interrupts to only be accepted after fetch and decode. But one hell is an issue to find without the CPU trace dumps!

Finishing up

So, as you can see, trace dumps are an great feature to have in RPU. A very simple implementation can yield enough information to work with on problems where the simulator just is not viable. With different trigger options, and the ability to customize the O_DBG signal to further narrow down issues under investigation, it’s invaluable. In fact, I’ll probably end up putting this system into any similarly complex FPGA project in the future. The HDL will shortly be submitted to the SoC github repo along with the updated core which supports interrupts.

Thanks for reading, as always – feel free to ask any follow up questions to me on twitter @domipheus.

The posts about how the interrupts were fully integrated into RPU are on their way!

Designing a RISC-V CPU in VHDL, Part 18: Control and Status Register Unit

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

You may remember after the switch from my own TPU ISA to RISC-V was made, I stated that interrupts were disabled. This was due to requirements for RISC-V style interrupt mechanisms not being compatible with what I’d done on TPU. In order to get Interrupts back into RPU, we need to go and implement the correct Control and Status Registers(CSRs) required for management of interrupts – enable bits, interrupt vectors, interrupt causes – they are all communicated via CSRs. As I want my core to execute 3rd-party code, this needs to be to spec!

In the time it’s taken me to start writing this article after doing the implementation, the latest draft RISC-V spec has moved the CSR instructions and definitions into it’s own Zicsr ISA extension. The extension defines 6 additional instructions for manipulating and querying the contents of these special registers, registers which define things such as privilege levels, interrupt causes, and timers. Timers are a bit of an oddity, which we’ll get back to later.

The CSRs within a RISC-V hardware thread (known as a hart) can influence the execution pipeline at many stages. If interrupts are enabled via enable bits contained in CSRs, additional checks are required after instruction writeback/retire in order to follow any pending interrupt or trap. One very important fact to take away from this is that the current values of particular CSRs are required to be known in many locations at multiple points in the instruction pipeline. This is a very real issue, complicated by differing privilege levels having their own local versions of these registers, so throughout designing the implementation of the RPU CSR unit, I wanted a very simple solution which worked well on my unpipelined core – not necessarily the best or most efficient design. This unit has been rather difficult to create, and the solution may be far from ideal. Of all the units I’ve designed for RPU, this is the one which I am continually questioning and looking for redesign possibilities.

The CSRs are accessed via a flat address space of 4096 entries. On our RV32I machine, the entries will be 32 bits wide. Great; lets use a block ram I hear you scream. Well, it’s not that simple. For RPU, the amount of CSRs we require is significantly less than the 4096 entries – and implementing all of them is not required. Whilst I’m sure some people will want 16KB of super fast storage for some evil code optimizations, we are very far from looking at optimizing RPU for speed – so, the initial implementation will look as follows:

  • The CSR unit will take operations as input, with data and CSR address.
  • A signal exists internally for each CSR we support.
  • The CSR unit will output current values of internal CSRs as required for CPU execution at the current privilege level.
  • The CSR unit will be multi-cycle in nature.
  • The CSR unit can raise exceptions for illegal access.

So, our unit will have signals for use when executing user CSR instructions, signals for notifying of events which auto-update registers, output signals for various CSR values needed mid-pipeline for normal CPU operation, and finally – some signals for raising exceptions – which will be needed later – when we check for situations like write operations to read-only CSR addresses.

The operations and CSR address will be provided by the decode stage of the pipeline, and take the 6 different CSR manipulation instructions and encode that into specific CSR unit control operations.

You can see from the spec that the CSR operations are listed as atomic. This does not really effect RPU with the current in-order pipeline, but may do in the future. The CSRRS and CSRRC (Atomic Read and [Set/Clear] Bits in CSR) instructions are read-modify-write, which instantly makes our CSR unit capable of multi-cycle operations. This will be the first pipeline stage that can take more than one cycle to execute, so is a bit of a milestone in complexity. Whilst there are only six instructions, specific values of rd and r1 can expand them to both read and write in the same operation.

All of the CSR instructions fall under the wide SYSTEM instruction type, so we add another set of checks in our decoder, and output translated CSR unit opcodes. The CSR Opcodes are generally just the funct3 instruction bits, with additional read/write flags. There are no real changes to the ALU – our CSR unit takes over any work for these operations.

The RISC-V privileged specification is where the real CSR listings exist, and the descriptions of their use. A few CSRs hold basic, read-only machine information such as vendor, architecture, implementation and hardware thread (hard) IDs. These CSRs are generally accessed only via the csr instructions, however some exist that need accessed at points in the pipeline. The Machine Status Register (mstatus) contains privilege mode and interrupt enable flags which the control unit of the CPU needs constant access to in order to make progress through the instruction pipe.

The CSR Unit

With the input and outputs requirements for the unit pretty clear, we can get started defining the basis for how our unit will be implemented. For RPU, we’ll have an internal signal for each relevant CSR, or manually hard-wire to values as required. Individual process blocks will be used for the various tasks required:

  • A “main” process block which handles any requested CSR operation.
  • A series of “update” processes which write automatically-updated CSRs, such as the instruction retired and cycle counters.
  • A “protection” process which will flag unauthorized access to CSR addresses.

At this point, we have not discussed unauthorized access to CSRs. The interrupt mechanism is not in place to handle them at this point in the blog, however, we will still check for them. Handily, the CSR address actually encodes read/write/privilege access requirements, defined by the following table:

So detecting and raising access exception signals when the operation is at odds with the current privilege level or address type is very simple. If it’s a write operation, and the address bits [11:10] == ’11’ then it’s an invalid access. Same with the privilege checks against the address bits in [9:8].

The “update” processes are, again, fairly simple.

The main process is defined as the following state machine.

As you can see, reads are quicker as they do not fully go into the modify/write phase. A full read-modify-write CSR operation will take 3 cycles. The result of the CSR read will be available 1 cycle later – the same latency as our current ALU operations. As the current latency of our pipeline means it’s over 2 cycles before the next CSR read could possibly occur, so we can have our unit work concurrently and not take up any more than the normal single cycle of our pipeline.

The diagram above shows the tightest possible pipeline timing for a CSR ReadModifyWrite, followed by a CSR read – and in reality the FETCH stage takes multiple cycles due to our memory latency, so currently it’s very safe to have the CSR unit run operations alongside the pipeline. As a quick aside, interrupts from here would only be raised on the first cycle when the CSR Operation was first seen, so they would be handled at the correct stages and not leak into subsequent instructions due to the unit running past the current set of pipeline stages.

A run in our simulation shows the expected behavior, with the write to CSR mtvec present after the 3 cycle write latency.

And the set bit operations also work:

Branching towards confusion

An issue arose after implementing the CSR unit into the existing pipeline, one which caused erroneous branching and forward progress when a CSR instruction immediately followed a branch. It took a while to track this down, although in hindsight with a better simulator testbench program it would have been obvious – which I will show:

Above is a shot of the simulator waveform, centered on a section of code with a branch instruction targeted at a CSR instruction. Instead of the instruction following the cycle CSR read being the cycleh read, there is a second execution of cycle CSR read, and then a branch to 0x00000038. The last signal in the waveform – shouldBranch – is the key here. The shouldBranch signal is controlled by the ALU, and in the first implementation of the CSR unit, the ALU was disabled completely if an CSR instruction was found by the decoder. This meant the shouldBranch signal was not reset after the previous branch (0x0480006f – j 7c) executed. What really needs to happen, is the ALU remain active, but understand that it has no side effects when a CSR operation is in place. Doing this change, results in the following – correct execution of the instruction sequence.

Conclusion

I’ve not really explained all of the CSR issues present in RISC-V during this part of the series. There is a whole section in the spec on Field Specifications, whereby bits within CSRs can behave as though there are no external side effects, however, bits can be hard wired to 0 on read. This is so that field edits which are currently reserved in an implementation will be able to run in future without unexpected consequences. Thankfully, while I did not go into the specifics here, my implementation allows for these operations to be implemented to spec. It’s just a matter of time.

This turned into quite a long one to write, on what I thought would be a fairly simple addition to the CPU. It was anything but simple! You can see the current VHDL code for the unit over on github.

Exceptions/Interrupts are coming next time, now that the relevant status registers are implemented. We’ll also touch on those pesky timer ‘CSRs’.

Thanks for reading, as always – feel free to ask any follow up questions to me on twitter @domipheus.

Designing a RISC-V CPU in VHDL, Part 17: DDR3 Memory Controller, Clock domain crossing

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

In the last part we got to the point where RISC-V code, built with GCC, could run and display text over HDMI and blink LEDs. However, this could only run from the 192KB of Block RAM we initialized within the Spartan7 FPGA on our Digilent Arty S7-50 board. Whilst 192KB is a nice amount of on-FPGA fast storage, we have a 256Mbyte DDR3 chip sitting next to the FPGA which is crying out for use. This post follows on from part 16 and integrates a DDR3 memory controller provided by Xilinx, and using an SD card Pmod adapter, load code from the SD card into that memory for use.

The DDR3 memory chip on the ArtyS7 seems to vary from board to board, but the timing specifications should be compatible for all. For clarity, the chip on my development board is a PMF511816EBR-KADN.

It connects with address and data signals direct to the FPGA. There are many other control signals, but we are not going to get into how DDR3 memory works in this post, as Xilinx Vivado comes with a wizard for generating memory interfaces. The Memory Interface Generator (MIG) seems to be an older utility and I had a few problems with it. However, in the end, I did get a working interface generated.

Memory Interface Generator

Digilents documentation for the ArtyS7 board includes various files for importing into MIG to assist with generation of the memory interface. In MIG itself, you will not see an option for importing settings explicitly – you need to select that you are verifying an existing design. Before the following section walking through the wizard, ensure your Vivado project is located in a local drive – I generally worked on a mapped network drive, and the build will fail to implement the controller at synthesis time if the project is on a network drive. Another item to note – you must go through the IP Catalog. When I tried to I initiate the MIG through the Block Design view that I used for generating BRAM in the previous article, VHDL solutions could not be generated.

We go through the wizard, using the Digilent files to configure the memory controller.Generally it is a case of hitting next though each screen, except from a few pages at the start where you need to input the file paths, and validate the pins – but the validation should succeed with a few warnings.

The next screen was quite something, with the wizard saying the browse buttons for file paths were not supported on Windows 10.

Validating the pins above will only generate a few warnings. Click next through the rest of the wizard and generate the design.

When the implementation of the memory controller has completed, which took minutes on a Ryzen 2700X+SSD, we can see it in the IP pane. At this point, you can open an example project by right clicking the IP in the hierarchy.

The example project can run in the simulator very slowly (initialization took many seconds in the waveform before real simulation begins). You can see the VHDL component definition to use for the controller, and I copied that into the RPU project.

The generated design uses the User Interface(UI) implementation to the memory controller, which preserves things like request ordering. There is a Xilinx document UG586 which details the signals used in this interface, including example waveforms for various read and write request states.

The UI used to utilize the memory controller is good for the use cases that I want: request ordering is preserved, burst reads/writes, writes with byte enable. I originally wanted to use the Native interface – which you can read about in document UG586 if you want to know more – but what I wanted was very basic. The native interface is significantly more involved to use as things like out of order request completion can occur.

In addition to the main data and control busses you’d expect, there are a few other signals which need connected in VHDL for the memory controller to operate. On ArtyS7, we need to feed it both 100 and 200MHz clocks. Additionally, we need to ensure the controller is not used after a reset until the initialization complete signal is asserted from the memory controller internals. One other major item of importance – the controller must be driven off of a ¼ memory interface clock, which for us is 81.25MHz. This is a good bit down on our current 200MHz CPU clock, but for now I just ran everything at this clock frequency, which is provided by the controller itself as an output.

The documentation from Xilinx shows how read and write transactions operate. The interface to the controller supports multiple back to back reads and writes for highest performance, but we only want basic single requests.

Writes follow this pattern:

Reads follow this:

We can implement these easily in our VHDL using the already existing MEM_proc memory request process. We will just add some new states to implement the various stages of a new state machine, and we should be good. The writes can be implemented in one of three ways, as explained in the above waveform diagrams. We are using method 1, asserting the *wdf signals for the write command data at the same time as the initiation of the commands using app_cmd.

As briefly mentioned earlier, we need to handle resetting of the memory controller. We activate the reset signals for the memory controller and CPU core for the same length of time. This is implemented with a decreasing 100MHz counter, initialized to some value in the thousands, with the resets being deactivated at 0. The DDR3 physical interface always initiates a calibration sequence after reset. When this operation is complete, a controller output signal is asserted which we should check for before sending memory requests.

Whilst making the requests and dealing with those states was straightforward, addressing and the resultant data output ordering was far from easy. I knew from MIG documentation that this controller would address 16 bits, so we just need to lop off the least significant bit from our memory request before passing it to the memory controller UI. However, after writing a small test program – which just wrote a known, incrementing pattern to contiguous memory addresses – it was obvious that my understanding of something else was awry. The data seemed to write 32 bits okay, but reading at different address offsets was quite the challenge. The Xilinx documentation does not explain the data organisation for the 128-bit data signals which come out of the controller. I assumed a simple burst read of [addr, addr+1, addr+2] data would present itself, but this didn’t seem to be the case. I memory mapped the internal signals in and out of the Memory Controller, and wrote another test which read and wrote to DDR3 memory, and dumped the 128-bit read buffer as well. As you can see from below, there were some really odd things going on:

I managed to find patterns for all the different read modes we needed, and the naturally aligned addresses for 4, 2 and 1 byte reads and writes. For writes, I used some lookup tables to assist with generating the byte enable signal – the 128bit write data input was byte masked, so that writing values which are not 128 buts in length do not appear to require a read-modify-write operation.

As it stands, I have still not found out why the read operations result in odd looking data from the controller. All things considered, it’s likely I have made an error early on in the implementation of the data swizzling for writes using the controller, which then dominos into the read operations. It’s something I’ll be returning to. This is fine for us just now, but I was intending on using the burst data to fill caches in later work on the CPU, and for that to work efficiently we will need to fully understand how this burst data organization works.

So. 256MB of ram unlocked, and passing fairly basic read/write tests. But the latency is very slow! Much slower than expected. I added some counters to the read state machine, and it seems to take around 22 cycles for a read. At 82MHz this is a long time. However, when searching online, it seems others have also seen this kind of latency, so I did not look into it – and instead tried to decouple the DDR memory clocking from that of the CPU and the rest of the SoC, like block rams.

Clock Domains

A clock domain can be imagined by separating out all aspects of your design into blocks. For each block which runs off a certain clock, it is on that clocks domain. Generally, blocks which synchronize off the same clock, and therefore are in the same domain, can pass signals between them relatively easily.

Currently, we have many clock domains in the SoC, but only 1 in the CPU side of things. The other clock domains are for HDMI output and due to the Block Rams being dual ported with dual clock inputs, we do not need to pass any raw signals across clock domains. For our DDR3 and CPU clocks to be different however, multiple signals would need to cross clock domains.

There are numerous issues that can arise when trying to pass synchronized signals across different clock domains. The simplest issue is when trying to read a fast clock signal from a slow clock – and the slow clock completely missing a strobe of the faster signal.

There are various methods of crossing signals from one clock to another, with some basic techniques starting with using multiple flip flops to latch signals into other domains. In VHDL, this just looks like a process which reads and latches the value from one signal into another on the destination clock.

As we are dealing with single read and write transactions, I decided to cross the CPU to Memory Controller clock domain using state handshaking. We will use two integer signals, which have multiple versions for stability in each clock domain, to implement a state machine process which crosses the clock safely. This can introduce several cycles of additional latency, however at the moment we are trying for a working solution – not a very efficient one. We can give up a few cycles of latency in the CPU clock domain if this means we can run the CPU much faster than the memory controller. The dataflow for a read looks like the following.

And writes:

The way this is implemented is with two processes, one clocked off the Memory Controllers 81.25MHz, the other off the higher CPU clock. The CPU process is actually just the MEM_proc one from before, to make interfacing with the CPUs own memory request logic easier. Each process reads the state of the other using stable signals which are latched into their own clock domain in an attempt to avoid metastability.
When both differently clocked processes are in a known state, I read the memory data required across the clock domain. I am not sure this is strictly safe – some say things such as data buses should be passed through a FIFO.

It took a few attempts to get this right, mainly due to realising that now I could not run the CPU at previously attainable speeds. It seems that just having the DDR3 controller synthesized into the SoC meant that the CPU core could not get over 166MHz. I have settled on 142.8MHz for now (1GHz/7 – easily attainable from my existing clocking system) as the CPU clock – fast enough for most needs but low enough that any timing issues do not arise. Many additional timing warnings appear on synthesis when the DDR3 controller is included, which was unexpected since this controller was generated by the internal tools. Like I mentioned last part, I do intend to look into these warnings and understand them with an aim of resolving them to unlock higher clocks. For now, 142MHz will do fine – and the additional memory space of DDR3 is very welcome!

With the DDR3 memory integrated, I also made quick microSD PMOD adaptor out of a level-shifting microSD adaptor intended for 5v Arduino projects using SPI. The level shifter is not needed for our purposes, as the FPGA already runs on 3.3v logic. No modification of the part was required – just soldering the pins to the 0.1” PMOD headers was enough to work.

The SPI port is exposed externally on Pmod JD. The way the SPI port is accessed is the same as discussed in the previous posts – just different memory mapped addresses. I created a new bootloader for the internal BRAM to perform a small memory test (which validates the endian byte swizzling), then initialize SPI and mount an SD card if present. It will examine a BOOT elf file and copy it into the required physical memory location, which can be in DDR3 or BRAM. It will then jump to that elf entry point to continue execution.

FPGA Utilization

With this all coming together to form a SoC design, I looked into the utilization report, which tells you how many resources your design is using. The report looks as follows:

The DDR3 memory controller takes up the largest amount of resources – not unexpected, they are incredibly complex! I’m quite happy with the current utilization of the RPU core itself. I’ve not been looking into optimization for resources, so having the CPU take up less than 6% of the Spartan S7-50 seems good. Looking at the breakdown of where the utilization of resources comes from, it seems the internal management of memory requests is taking a larger amount of resourced than I’d expect. It will be interesting to see how this changes when the endian swizzle logic is brought into the CPU core, and eventual cache logic added.

That bring this part to a close. I am now looking into implementing the required RISC-V Control and Status Registers (CSRs) into RPU. Currently I have memory-mapped data that should be available though CSR instructions. Adding them will be interesting as it will require changes to the CPU interface. Read about that next time!

As mentioned last time, the code for the DDR3 controller is already up on github. I still need to check that I’ve implemented the read/write data swizzling to the DDR3 correctly – I’ve a feeling there is a mistake in the writes somewhere, which then impacts the reads.
Thanks for reading, and as always, any questions can be directed at myself on twitter @domipheus.