This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.
You may remember after the switch from my own TPU ISA to RISC-V was made, I stated that interrupts were disabled. This was due to requirements for RISC-V style interrupt mechanisms not being compatible with what I’d done on TPU. In order to get Interrupts back into RPU, we need to go and implement the correct Control and Status Registers(CSRs) required for management of interrupts – enable bits, interrupt vectors, interrupt causes – they are all communicated via CSRs. As I want my core to execute 3rd-party code, this needs to be to spec!
In the time it’s taken me to start writing this article after doing the implementation, the latest draft RISC-V spec has moved the CSR instructions and definitions into it’s own Zicsr ISA extension. The extension defines 6 additional instructions for manipulating and querying the contents of these special registers, registers which define things such as privilege levels, interrupt causes, and timers. Timers are a bit of an oddity, which we’ll get back to later.
The CSRs within a RISC-V hardware thread (known as a hart) can influence the execution pipeline at many stages. If interrupts are enabled via enable bits contained in CSRs, additional checks are required after instruction writeback/retire in order to follow any pending interrupt or trap. One very important fact to take away from this is that the current values of particular CSRs are required to be known in many locations at multiple points in the instruction pipeline. This is a very real issue, complicated by differing privilege levels having their own local versions of these registers, so throughout designing the implementation of the RPU CSR unit, I wanted a very simple solution which worked well on my unpipelined core – not necessarily the best or most efficient design. This unit has been rather difficult to create, and the solution may be far from ideal. Of all the units I’ve designed for RPU, this is the one which I am continually questioning and looking for redesign possibilities.
The CSRs are accessed via a flat address space of 4096 entries. On our RV32I machine, the entries will be 32 bits wide. Great; lets use a block ram I hear you scream. Well, it’s not that simple. For RPU, the amount of CSRs we require is significantly less than the 4096 entries – and implementing all of them is not required. Whilst I’m sure some people will want 16KB of super fast storage for some evil code optimizations, we are very far from looking at optimizing RPU for speed – so, the initial implementation will look as follows:
- The CSR unit will take operations as input, with data and CSR address.
- A signal exists internally for each CSR we support.
- The CSR unit will output current values of internal CSRs as required for CPU execution at the current privilege level.
- The CSR unit will be multi-cycle in nature.
- The CSR unit can raise exceptions for illegal access.
So, our unit will have signals for use when executing user CSR instructions, signals for notifying of events which auto-update registers, output signals for various CSR values needed mid-pipeline for normal CPU operation, and finally – some signals for raising exceptions – which will be needed later – when we check for situations like write operations to read-only CSR addresses.
The operations and CSR address will be provided by the decode stage of the pipeline, and take the 6 different CSR manipulation instructions and encode that into specific CSR unit control operations.
You can see from the spec that the CSR operations are listed as atomic. This does not really effect RPU with the current in-order pipeline, but may do in the future. The CSRRS and CSRRC (Atomic Read and [Set/Clear] Bits in CSR) instructions are read-modify-write, which instantly makes our CSR unit capable of multi-cycle operations. This will be the first pipeline stage that can take more than one cycle to execute, so is a bit of a milestone in complexity. Whilst there are only six instructions, specific values of rd and r1 can expand them to both read and write in the same operation.
All of the CSR instructions fall under the wide SYSTEM instruction type, so we add another set of checks in our decoder, and output translated CSR unit opcodes. The CSR Opcodes are generally just the funct3 instruction bits, with additional read/write flags. There are no real changes to the ALU – our CSR unit takes over any work for these operations.
The RISC-V privileged specification is where the real CSR listings exist, and the descriptions of their use. A few CSRs hold basic, read-only machine information such as vendor, architecture, implementation and hardware thread (hard) IDs. These CSRs are generally accessed only via the csr instructions, however some exist that need accessed at points in the pipeline. The Machine Status Register (mstatus) contains privilege mode and interrupt enable flags which the control unit of the CPU needs constant access to in order to make progress through the instruction pipe.
The CSR Unit
With the input and outputs requirements for the unit pretty clear, we can get started defining the basis for how our unit will be implemented. For RPU, we’ll have an internal signal for each relevant CSR, or manually hard-wire to values as required. Individual process blocks will be used for the various tasks required:
- A “main” process block which handles any requested CSR operation.
- A series of “update” processes which write automatically-updated CSRs, such as the instruction retired and cycle counters.
- A “protection” process which will flag unauthorized access to CSR addresses.
At this point, we have not discussed unauthorized access to CSRs. The interrupt mechanism is not in place to handle them at this point in the blog, however, we will still check for them. Handily, the CSR address actually encodes read/write/privilege access requirements, defined by the following table:
So detecting and raising access exception signals when the operation is at odds with the current privilege level or address type is very simple. If it’s a write operation, and the address bits [11:10] == ’11’ then it’s an invalid access. Same with the privilege checks against the address bits in [9:8].
The “update” processes are, again, fairly simple.
The main process is defined as the following state machine.
As you can see, reads are quicker as they do not fully go into the modify/write phase. A full read-modify-write CSR operation will take 3 cycles. The result of the CSR read will be available 1 cycle later – the same latency as our current ALU operations. As the current latency of our pipeline means it’s over 2 cycles before the next CSR read could possibly occur, so we can have our unit work concurrently and not take up any more than the normal single cycle of our pipeline.
The diagram above shows the tightest possible pipeline timing for a CSR ReadModifyWrite, followed by a CSR read – and in reality the FETCH stage takes multiple cycles due to our memory latency, so currently it’s very safe to have the CSR unit run operations alongside the pipeline. As a quick aside, interrupts from here would only be raised on the first cycle when the CSR Operation was first seen, so they would be handled at the correct stages and not leak into subsequent instructions due to the unit running past the current set of pipeline stages.
A run in our simulation shows the expected behavior, with the write to CSR mtvec present after the 3 cycle write latency.
And the set bit operations also work:
Branching towards confusion
An issue arose after implementing the CSR unit into the existing pipeline, one which caused erroneous branching and forward progress when a CSR instruction immediately followed a branch. It took a while to track this down, although in hindsight with a better simulator testbench program it would have been obvious – which I will show:
Above is a shot of the simulator waveform, centered on a section of code with a branch instruction targeted at a CSR instruction. Instead of the instruction following the cycle CSR read being the cycleh read, there is a second execution of cycle CSR read, and then a branch to 0x00000038. The last signal in the waveform – shouldBranch – is the key here. The shouldBranch signal is controlled by the ALU, and in the first implementation of the CSR unit, the ALU was disabled completely if an CSR instruction was found by the decoder. This meant the shouldBranch signal was not reset after the previous branch (0x0480006f – j 7c) executed. What really needs to happen, is the ALU remain active, but understand that it has no side effects when a CSR operation is in place. Doing this change, results in the following – correct execution of the instruction sequence.
I’ve not really explained all of the CSR issues present in RISC-V during this part of the series. There is a whole section in the spec on Field Specifications, whereby bits within CSRs can behave as though there are no external side effects, however, bits can be hard wired to 0 on read. This is so that field edits which are currently reserved in an implementation will be able to run in future without unexpected consequences. Thankfully, while I did not go into the specifics here, my implementation allows for these operations to be implemented to spec. It’s just a matter of time.
This turned into quite a long one to write, on what I thought would be a fairly simple addition to the CPU. It was anything but simple! You can see the current VHDL code for the unit over on github.
Exceptions/Interrupts are coming next time, now that the relevant status registers are implemented. We’ll also touch on those pesky timer ‘CSRs’.
Thanks for reading, as always – feel free to ask any follow up questions to me on twitter @domipheus.