Designing a RISC-V CPU in VHDL, Part 21: Multi-cycle execute for multiply and divide

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and Iโ€™d recommend they are read before continuing.

One of the things which RPU has done from the start is keep the cpu pipeline very simple. It’s a Fetch, Decode, Execute, [Memory], Writeback pipeline, but it does not run pipelined. I.e, at any cycle, only one of these stages is active, and all state within the cpu corresponds to a single instruction from fetch to writeback. Due to this our instructions-per-cycle (IPC) count is very low, but the implementation is simpler to understand in terms of dataflow.

The control unit decides what stage is active, drawing on state outputs from previous stages to make those decisions. Each of the decode, execute and writeback pipeline stages takes a single cycle to execute. Memory stages, like Fetch and Memory can take a variable amount of cycles. The CSR unit actually requires multiple cycles to operate – a write or read-modify-write operation like csrrci takes 3 cycles to fully complete – however this is done asynchronously with the traditional pipeline, so does not impact upon the 1 cycle per stage limitation. The execute stage for a csrrci instruction remains one cycle.

Some operations really do require multiple cycles to execute – without asyncronous operation – and the execute stage will need to wait for it to complete. An example of this is integer multiply and divide, operations defined in the RISC-V M extension. RPU did not support these operations, but now does – along with an execution phase which can span an arbitrary amount of clock cycles. This is how it’s implemented.

Let’s start – Multiply

Currently, the RPU decode stage requests an illegal instruction exception be raised if an opcode relating to the M-extensions are found. Previously, I have used this mechanism to implement software multiply and divide in the exception handler, which allowed Zephyr RTOS to boot while building binaries which targeted rv32m – slow, but effective.

The first thing we need to change is this behaviour. Allow the M-extension functions to pass the decoder, and set a multicycle bit in the decoder output for the control unit to consume.

The multi-cycle bit

The control unit of RPU is basic, and the main cpu stages simply advanced to the next stage every cycle – as long as no interrupts are requested. We need to modify this behaviour when we are in the execution stage to enable multi-cycle use. We do this with a pair of new signals, one from the decoder “O_multycyAlu” and one from the ALU, “O_wait”.

The output from the decoder “O_multicyAlu” is really an optimization; we could ignore this and simply rely on the O_wait from the ALU, but this would add a cycle of latency to every excute stage – something we do not want. So, the decoder knows that the multiply and divide instructions take multiple cycles, so we tell the control unit this. The control unit then keeps the execute state active, until the ALU O_wait output is not asserted. Simple, and works well for our use case. This is when the differences come in for how multiply and divide will be implemented on RPU, so we’ll jump right back into how multiply is handled now in the ALU.

From the spec, we can see that there are 4 different multiply operations we need to implement. MULW only applies to rv64, so is ignored in our rv32 implementation.

  • MUL – write bottom 32-bit result of rs1 * rs2 into rD.
  • MULH – write upper 32-bit result of (signed(rs1)) * (signed(rs2)) into rD.
  • MULHU – write upper 32-bit result of (unsigned(rs1)) * (unsigned(rs2)) into rD.
  • MULHSU – write upper 32-bit result of (signed(rs1)) * (unsigned(rs2)) into rD.

For my implementation, I perform three multiplies for the various signed/unsigned options, and then in the next cycle I select what is written to the destination register.

I rely on the VHDL integer typecasts for unsigned/signed, apart from the MULHSU case, whereby I manually sign-extend one operand by appending the MSB, and then the other operand gets ‘0’ appended for forced unsigned. I use a 2-state machine to keep track of when to write the result – but as you can see from the comment, we immediately deassert O_wait on the first encounter, as there is always an additional cycle available to complete the operation.

FPGA Multiply

“But Colin – you’re just… multiplying, on an FPGA, and it’s working?!” – Yup. My target FPGA, the Xilinx Spartan 7-50 on a Digilent Arty S7 board synthesizes a 32-bit integer multiply operation in the VHDL into a set of cascaded DSP primitive slices – neat huh?

Just like the blocks I use for fast ram on the FPGA, there are fixed function hardware blocks embedded into the FPGA which can be utilized by user designs. In this case, it’s the DSP48 block, which has a 25x18bit multiplier. The synthesis tools chain multiple of these blocks together to generate our 32×32 results.

Divide

For division, it’s not so easy. I need to implement it manually. Thankfully, integer binary long division is an algorithm which can be easily understood and transformed into VHDL for use in my ALU.

I decided to create a new division entity which encapsulated this algorithm and exposed signals for the ALU to use. I took the algorithm explained on wikipedia as above, and then made it also work for the various operations required:

As with multiply, some operations only apply to rv64 which we can ignore. We need to implement the following:

  • DIV – signed integer division
  • DIVU – unsigned integer division
  • REM – signed integer remainder
  • REMU – unsigned integer remainder

The long division algorithm assumes unsigned integers, so for signed use we can test the sign bits when we start a divide operation. We then feed the negated values, if negative, through the unsigned operation loop – before restoring the sign at the end of the operation to get the correct result.

There are some edge cases that are handled separately – notably division by zero. In the RISC-V spec, division by zero specifically does not raise an exception.

The division unit runs in three states; IDLE, INFLIGHT and COMPLETE.

STATE_IDLE: Waits for the execution instruction, and if found, sets up the inputs for the required operation and passes to inflight. For division by zero or one, these edge cases are handled and we short circuit to complete.

STATE_INFLIGHTU: Runs for 32 cycles, performing the unsigned binary long division algorithm loop.

STATE_COMPLETE: Selects the required output values, and adjusts for signed/unsigned operation.

The inflight state performs the loop from the algorithm, and the VHDL can be seen to match the wikipedia psudocode:

Note that when R is referred, we always need to refer to its full representation as s_R(30 downto 0) & s_N(s_i), as that initial assignment in the wikipedia code would not be available until the next cycle in the VHDL.

The interface for the division unit is as follows.

The operations passed in I_op correspond to two bits of the funct3 part of the opcode for divide operations. This means they pass straight through from the decoder to this divide unit. You’ll notice there is an interrupt output for this unit, but currently due to the RISC-V spec this is not used.

Now that we have a facility for performing the 32-bit integer divide operations we require – it’s time to put it all together! We place the the divide unit component in our alu, and control it when we recieve a divide operation. We control the o_wait output of the ALU, keeping it active until the divide unit is complete, before forwarding the result to the ALU’s own result. The control unit then transfers us to writeback, for the register write of the result.

And that’s it. I used the riscv-compliance runner for mul and div to test my implementation, and fixed a few issues with signed handling in divide – but other than that, things were generally straightforward.

In terms of performance, the biggest win was for multiply- two cycles is pretty quick compared to the super slow emulation with invalid instruction interrupts, or soft multiply libraries. Divide wasn’t as big a win, but this is mainly due to my use cases not making much use of divide. In Doom, there was a gain with the hardware multiply – I hope to cover that in more depth in a post specifically on optimizing the SoC for Doom. Using hardware multiply support instead of relying on the software multiply I was using (based on llvm compiler-rt) resulted in a 28% increase in FPS on timedemo demo3.

The performance is as follows:

Multiply: execute: 2 cycles
Divide/Remainder: execute: 34 cycles

Possible Optimizations

Reading the spec on division operations you can see the above note on operation fusing. This is an optimization whereby if we need both the quotient and remainder from an operation, the cpu will only do the slow division operation once, and then use the results from that to complete both DIV and REM requests.

The way the division unit has been fixed to the ALU means at the moment, for an instruction sequence of say div r4, r1, r2; rem r5, r1, r2, we will take at least 68 cycles to complete within the execution stage. Given that the inputs to the instructions (r1, r2) remain constant in this sequence, we should be able to reuse the intermediate results of the division instruction to complete the remainder operation quicker.

This would look somewhat like the following:

  • Within the alu_int32_div entity; when we get an I_exec command, test the input operands against those for a valid result we have previously computed.
    • If the new input match the previously computed, valid, operation;
    • Perform any state modification which depends on operation – as the operation can change, despite the data inputs being constant
    • Move to STATE_COMPLETE, completely skipping STATE_INFLIGHT
  • Write the output as before, which will draw on the currently requested operation, and use the existing values for the s_Q and s_R intermediate results.

With this set-up, skipping the STATE_INFLIGHT section will save 32 cycles – meaning out div+rem operation will now only take 36 cycles, a considerable speedup if this is a hot path in the code.

As the comparisons for testing the input operands will compare actual data values, rather than register numbers, we don’t need to worry about hazards. A register write to r1 between the operations of a different value would mean the inflight short-circuit would not be taken. Additionally, it means that the DIV+REM instructions do not need to follow each other for this to work.

I’ve not implemented this optimization in RPU yet, but it is on my to-do list!

Conclusion

We’ve complicated the pipeline picture a little, but added some nice functionality in the process.

For those of you who are also making RV32 cores – you may be interested to know you can pass a GCC toolchain an option which emits multiply instructions, but not division. This would have been great for me during development of RPU 1.0! I Learned this from Luke Wren over on twitter.

Additionally, seems the RISC-V spec is getting a multiply-only extention which in my view is a good move.

That’s it for this part in the series! thank you for reading. It’s been a long time since the last part, my excuse is that’s been a fairly crazy year. The multiply and divide functionality is already on github, and has been for quite a while as part of the RPU 1.0 release. If you have any further questions, please do not hesitate to send me a message on twitter @domipheus!

Designing a RISC-V CPU in VHDL, Part 20: Interrupts and Exceptions

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and Iโ€™d recommend they are read before continuing.

Interrupts and exceptions are important events that any CPU needs to handle. The usual definition is that interrupts happen outside of the CPU – timer events, for example. Exceptions occur within the CPU, like trying to execute an invalid instruction. These events are handled all the time within a system, and whilst some signify faults and error conditions – most are just handling system functionality.

I mentioned earlier in the series how my previous CPU (TPU) had interrupt handling. However the way it was implemented needed significant modification to work in a RISC-V environment. RPU now supports many more types of exception/interrupt, and as such is more complex.

Before we go further, in the RPU code I use the term interrupt to refer to both interrupts and exceptions. Unless I explicitly mention exceptions it, assume I mean both types.

The Local Interrupt Unit

RPU will implement the timer interrupts as external, similar to how TPU did it. It will also support in invalid instruction, system calls, breakpoints, invalid CSR access (and ALU), and misaligned jump/memory. These generally fit into 4 categories:

  • Decoder Exceptions
  • CSR/ALU unit exceptions
  • Memory Exceptions, and
  • External interrupts

There are more subcategories to these, defined by an additional 32bit data value describing the cause further, but these 4 categories can fit nicely into 4 interrupt lines. The CPU can only handle one at a time, so with this in mind I created a Local Interrupt unit, the LINT unit, which will take all the various interrupt request and associated data lines, and decide which one actually makes its way into the control unit for handling. Internally, it is implemented as a simple conditional check of the different input categories, and then forwarding the data to the control unit, waiting for a acknowledge reset signal before going on to the next interrupt, if multiple were being requested at once. The LINT also handles ack/reset forwarding to the original input units.

With this unit complete, we can add an O_int and O_intData output, as well as an acknowledge input for reset, to our decoder unit. This will attempt to raise an exception and set the intData output to be the cause as defined by the RISC-V standard, which will let any interrupt handler know which kind of request – invalid instruction, ecall/system call, breakpoint – caused the exception.

The CSR unit from the previous part already has a facility to raise an exception – it can check the CSR op and address to ensure the operation is valid. For instance, attempting to write a read only CSR would raise an access exception. Whilst this is all implemented and connected up, the compliance suite of tests does not test access interrupts, so its not extensively tested. We will need to reinvestigate that once RPU is extended to fully support the different runtime privilege levels.

Memory exceptions from misaligned instruction fetch are found by testing the branch targets for having the 1st bit set. The 0th bit by the specification is always cleared, and we donโ€™t support the compressed instruction set, so a simple check is all we need for this.

The load and store misaligned interrupts are found by testing memory addresses depending on request size and type. RPU will raise exceptions for any load or store on a non-naturally aligned address.

Lastly, external interrupts have the signal lines directly routed outside of the CPU core, so the SoC implementation can handle those. In the ArtyS7-RPU-SoC, timer interrupts are implemented via a 12MHz clock timer and compare register manipulated via MMIO. We could also implement things like UART receive data waiting interrupts through this.

Execution Control

Now, we know what can trigger interrupts in the CPU, but we need to lay down exactly the steps and dataflow required both when we enter an interrupt handler, and exit from it. The control unit handles this.

As a reminder – here is how a traditional external interrupt was handled when ported simply through to RPU from my old TPU project. You can see the interrupt had to wait until a point in the pipeline which was suitable, which is okay in this instance. However, exceptions require significant changes to the control unit flow.

Interrupt entry / exit

On a decision being made to branch to the interrupt vector – the location of which is stored in a CSR – several other CSR contents need modified:

  1. The previous interrupt enable bit is set to the current interrupt enable value.
  2. the interrupt enable bit is set to 0.
  3. the previous privilege mode is set to the current privilege
  4. the privilege mode is set to 11
  5. the mcause CSR is set to the interrupt data value
  6. the mepc CSR is set to the PC where the interrupt was encountered.
  7. the mvtval CSR is set to the location of any exception- specific data, like the address of a misaligned load.

On exit from an interrupt via mret, the previous enable and privilege values are restored. These csr manipulations will occur internal to the CSR unit, using int_exit and int_entry signals provided to it by the control unit.

The control unit

The previous TPU work implemented interrupts by checking for assertion of the interrupt signal at the end of the CPU pipeline, just before the fetch stage. This works fine for external interrupts, and it keeps complexity low, due to not having to pause the pipeline mid-execution. However, we now have different classes of interrupt which need mid-pipeline handling:

  • Decoder interrupts
  • CSR/ALU interrupts
  • Misalignment/Memory interrupts

For decoder interrupts, we can check for an INT signal from the decoder, and ensure any memory/register writes don’t occur a few cycles later. The misalignment interrupts can be triggered at fetch and memory pipeline stages and are more complex.

In the previous part of this series, where I added CPU trace support, I discussed some of the logic flow that a decoder interrupt takes. It followed on that how different types of interrupt have higher priority, and a priority system is needed. The LINT was supposed to handle this priority system – and did in general – in that, decoder exceptions are higher priority than external exceptions. However, the LINT has no concept of where execution is in the pipeline, and how at certain execution stages actually require certain exceptions are handled immediately, regardless of how many cycles previously another interrupt was requested. This required another rather clunky set of conditions be provided to the LINT unit as a set of enables bits for the various types. Some enables were only active during certain pipeline stages. My decision to not separate out how Interrupts (like external timers) and Exceptions (CPU internal faults that must be immediately handled) has bitten me here by requiring specific enable overrides, and some more workarounds.

Memory Misalignment Exceptions

I thought the memory misalignment exceptions would be a fairly simple addition, however they presented an interesting challenge, due to the fixed timing that is inherent within the memory/fetch parts of the core.

Discovering whether a misalignment exception should assert is fairly simple, we can have a process which checks addresses and asserts the relevant INT line with associated mcause value to indicate the type of exception:

  • load
  • store
  • branch

The LINT unit has a cycle of latency, and when it’s a memory exception we are talking about, that cycle of latency means it’s too late – the memory operation or fetch will have already been issued. The latency is acceptable for a decoder interrupt, because the writeback phase is still 2 cycles away and the interrupt will be handled by that point, avoiding register file side-effects.

The side effects of an invalid memory operation are hard to track, so instead we forward a hint to the control unit that a misalignment exception is likely to occur shortly. It’s rather clumsy, but this hint gets around the LINT latency and allows the control unit to stall any memory operations. This stall is just long enough for the operations to not issue, and the exception be successfully handled once the LINT responds.

These stalls are implemented by a cycle counter in the control unit, counting down a fixed number of stall cycles if a misalignment hint is seen. During each of these cycles, the interrupt signal is checked in case we need to jump to the handler. The control unit is definitely complicated significantly by these checks. There is a lot of copy and paste nonsense going on here.

Lastly, to further complicate things; my memory-mapped IO range of addresses (currently 0xF0000000 – 0xFFFFFFFF) has various addresses which I write from the bootloader firmware in an unaligned way. To fix this I’ve excluded this memory range from the misalignment checks. I’ll fix it another time.

So now we have this align_hint shortcut which bypasses the LINT and allows for correct handling of the various memory exceptions.

Further Bugs

An issue that was discovered during a simulation of the decoder interrupts specifically for ecall and ebreak. The method for acknowledging interrupt requests was that the LINT would assert an ACK signal for the unit it selected, and that unit would then de-assert it’s INT signal. The problem is that I’d placed this de-assertion insider the rest of the decoder handling process, which was only active when the pipeline decoder enable was active. By the time the LINT wanted to acknowledge the decoder interrupt, the enable was not active. This resulted in an infinite loop of interrupt requests from the decoder, which was not much use!

Another bug was that I was latching the wrong PC value to the mepc register. This was not caught until I started debugging actual code, but would have shown up pretty obviously in the simulator. The fix was to not grab the current PC value for mepc but to latch the correct value at the time the interrupt was fired.

Lastly, as I was testing the riscv-compliance misalignment exception test, I realised an exception was being raised when it shouldn’t. Turns out I had missed a point in the ISA spec, whereby jump branch targets always have bit 0 masked off. An easy thing to fix, but annoying I had this bug for so long.

But, RPU now passes risc-v compliance ๐Ÿ™‚

Summing up

With the interrupt support now in the CPU design, I have realised just how many mistakes I have made in this area. Not separating out different types for interrupts and exceptions (which need handled immediately in the pipeline) meant the LINT needed these ugly overriding hints for the control unit in order to operate correctly. It all seems a bit messy.

However, it does work. I can fix its design later, now I have a working implementation to base changes off of.

Thanks for reading, as always – feel free to ask any follow up questions to me on twitter @domipheus.

Designing a RISC-V CPU in VHDL, Part 19: Adding Trace Dump Functionality

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and Iโ€™d recommend they are read before continuing.

For those who follow me on twitter, youโ€™ll have seen my recent tweets regarding Zephyr OS running on RPU. This was a huge amount of work to get running, most of it debugging on the FPGA itself. For those new to FPGA development, trying to debug on-chip can be a very difficult and frustrating experience. Generally, you want to debug in the simulator โ€“ but when potential issues are influenced by external devices such as SD cards, timer interrupts, and hundreds of millions of cycles into the boot process of an operating system โ€“ simulators may not be feasible.

Blog posts on the features I added to RPU to enable Zephyr booting, such as proper interrupts, exceptions and timers are coming – but it would not have been possible without a feature of the RPU SoC I have not yet discussed.

CPU Tracing

Most real processors will have hardware features built in, and one of the most useful low-level tools is tracing. This is when at an arbitrary time slice, low level details on the inner operation of the core are captured into some buffer, before being streamed elsewhere for analysis and state reconstruction later.

Note that this is a one-way flow of data. It is not interactive, like the debugging most developers know. It is mostly used for performance profiling but for RPU would be an ideal debugging aid.

Requirements

For the avoidance of doubt; I’m defining “A Trace” to be one block of valid data which is dumped to a host PC for analysis. For us, dumping will be streaming the data out via UART to a development PC. Multiple traces can be taken, but when the data transfer is initiated, the data needs to be a real representation of what occurred immediately preceding the request to dump the trace. The data contained in a trace is always being captured on the device in order that if a request is made, the data is available.

These requirements require a circular buffer which is continually recording the state. I’ll define exactly what the data is later – but for now, the data is defined as 64-bits per cycle. Plenty for a significant amount of state to be recorded, which will be required in order to perform meaningful analysis. We have a good amount of block rams on our Spartan 7-50 FPGA, so we can dedicate 32KB to this circular buffer quite easily. 64-bits into 32KB gives us 4,096 cycles of data. Not that much you’d think for a CPU running at over 100MHz, but you’d be surprised how quickly RPU falls over when it gets into an invalid state!

It goes without saying that our implementation needs to be non-intrusive. I’m not currently using the UART connected to the FTDI USB controller, as our logging output is displayed graphically via a text-mode display over HDMI. We can use this without impacting existing code. Our CPU core will expose a debug trace bus signal, which will be the data captured.

We’ve mentioned the buffer will be in a block ram; but one aspect of this is that we must be wary of the observer effect. This issue is very much an issue for performance profiling, as streaming out data from various devices usually goes through memory subsystems which will increase bandwidth requirements, and lead to more latency in the memory operations you are trying to trace. Our trace system should not effect the execution characteristics of the core at all. As we are using a development PC to receive the streamed data, we can completely segregate all data paths for the trace system, and remove the block ram from the memory mapped area which is currently used for code and data. With this block ram separate, we can ensure it’s set up as a true dual port ram with data width the native 64bit. One port will be for writing data from the CPU, on the CPU clock domain. The second port will be used for reading the data out at a rate which is dictated by the UART serial baud – much, much, slower. Doing this will ensure tracing will not impact execution of the core at any point, meaning our dumped data is much more valuable.

Lastly, we want to trigger these dumps at a point in time when we think an issue has occurred. Two immediate trigger types come to mind in addition to a manual button.

  1. Memory address
  2. Comparison with the data which is to be dumped; i.e, pipeline status flags combined with instruction types.

Implementation

The implementation is very simple. I’ve added a debug signal output to the CPU core entity. It’s 64 bits of data consisting of 32 bits of status bits, and a 32-bit data value as defined below.

This data is always being output by the core, changing every cycle. The data value can be various things; the PC when in a STAGE_FETCH state, the ALU result, the value we’re writing to rD in WRITEBACK, or a memory location during a load/store.

We only need two new processes for the system:

  • trace_streamout: manages the streaming out of bytes from the trace block ram
  • trace_en_check: inspects trigger conditions in order to initiate a trace dump which trace_streamout will handle

The BRAM used as the circular trace buffer is configured as 64-bits word length, with 4096 addresses. It was created using the Block Memory Generator, and has a read latency of 2 cycles.

We will use a clock cycle counter which already exists for dictating write locations into the BRAM. As it’s used as a circular buffer, we simply take the lower 12 bits of the clock counter as address into the BRAM.

Port A of the BRAM is the write port, with it’s address line tied to the bits noted above. It is enabled by a signal only when the trace_streamout process is idle. This is so when we do stream out the data we want, it’s not polluted with new data while our slow streamout to UART is active. That new data is effectively lost. As this port captures the cpu core O_DBG output, it’s clocked at the CPU core clock.

Port B is the read port. It’s clocked using the 100MHz reference clock (which also drives the UART – albeit then subsampled via a baud tick). It’s enabled when a streamout state is requested, and reads an address dictated by the trace_streamout process.

The trace_streamout process, when the current streamout state is idle, checks for a dump_enable signal. Upon seeing this signal, the last write address is latched from the lower cycle counter 12 bits. We also set a streamout location to be that last write +1. This location is what is fed into Port B of the BRAM/ circular trace buffer. When we change the read address on port B, we wait some cycles for the value to properly propagate out. During this preload stall, we also wait for the UART TX to become ready for more data. The transmission is performed significantly slower than the clock that trace_streamout runs at, and we cannot write to the TX buffer if it’s full.

The UART I’m using is provided by Xilinx and has an internal 16-byte buffer. We wait for a ready signal as then we know that writing our 8 bytes of debug data (remember, 64-bit) quickly into the UART TX will succeed. In addition to the 8 bytes of data, I also send 2 bytes of magic number data at the start of every 64-bit packet as an aid to the receiving logic; we can check the first two bytes for these values to ensure we’re synced correctly in order to parse the data eventually.

After the last byte is written, we increment our streamout location address. If it’s not equal to the last write address we latched previously, we move to the preload stall and move the next 8 bytes of trace data out. Otherwise, we are finished transmitting the entire trace buffer, so set out state back to idle and re-enable new trace data writes.

Triggering streamout

Triggering a dump using dump_enable can be done a variety of ways. I have a physical push-button on my Arty S7 board set to always enable a dump, which is useful to know where execution currently is in a program. I have also got a trigger on reading a certain memory address. This is good if there is an issue triggering an error which you can reliably track to a branch of code execution. Having a memory address in that code branch used as trigger will dump the cycles leading up to that branch being taken. There are one other types of trigger – relying on the cpu O_DBG signal itself, for example, triggering a dump when we encounter an decoder interrupt for an invalid instruction.

I hard-code these triggers in the VHDL currently, but it’s feasible that these can be configurable programmatically. The dump itself could also be triggered via a write to a specific MMIO location.

Parsing the data on the Debug PC

The UART TX on the FPGA is connected to the FTDI USB-UART bridge, which means when the FPGA design is active and the board is connected via USB, we can just open the COM port exposed via the USB device.

I made a simple C# command line utility which just dumps the packets in a readable form. It looks like this:

[22:54:19.6133781]Trace Packet, 00000054,  0xC3 40 ,   OPCODE_BRANCH ,     STAGE_FETCH , 0x000008EC INT_EN , :
[22:54:19.6143787]Trace Packet, 00000055,  0xD1 40 ,   OPCODE_BRANCH ,    STAGE_DECODE , 0x04C12083 INT_EN , :
[22:54:19.6153795]Trace Packet, 00000056,  0xE1 40 ,     OPCODE_LOAD ,       STAGE_ALU , 0x00000001 INT_EN , :
[22:54:19.6163794]Trace Packet, 00000057,  0xF1 C0 ,     OPCODE_LOAD ,    STAGE_MEMORY , 0x0000476C REG_WR  INT_EN , :
[22:54:19.6183798]Trace Packet, 00000058,  0x01 C0 ,     OPCODE_LOAD ,    STAGE_MEMORY , 0x0000476C REG_WR  INT_EN , :
[22:54:19.6183798]Trace Packet, 00000059,  0x11 C0 ,     OPCODE_LOAD ,    STAGE_MEMORY , 0x0000476C REG_WR  INT_EN , :
[22:54:19.6193799]Trace Packet, 00000060,  0x20 C0 ,     OPCODE_LOAD ,    STAGE_MEMORY , 0x0000476C REG_WR  INT_EN , :
[22:54:19.6203802]Trace Packet, 00000061,  0x31 C0 ,     OPCODE_LOAD ,    STAGE_MEMORY , 0x0000476C REG_WR  INT_EN , :
[22:54:19.6213808]Trace Packet, 00000062,  0x43 C0 ,     OPCODE_LOAD ,    STAGE_MEMORY , 0x0000476C REG_WR  INT_EN , :
[22:54:19.6213808]Trace Packet, 00000063,  0x51 C0 ,     OPCODE_LOAD , STAGE_WRITEBACK , 0x00001CDC REG_WR  INT_EN , :

You can see some data given by the utility such as timestamps and a packet ID. Everything else is derived from flags in the trace data for that cycle.

Later I added some additional functionality, like parsing register destinations and outputting known register/memory values to aid when going over the output.

[22:54:19.6213808]Trace Packet, 00000062,  0x43 C0 ,     OPCODE_LOAD ,    STAGE_MEMORY , 0x0000476C REG_WR  INT_EN , :
[22:54:19.6213808]Trace Packet, 00000063,  0x51 C0 ,     OPCODE_LOAD , STAGE_WRITEBACK , 0x00001CDC REG_WR  INT_EN , :
  MEMORY 0x0000476C = 0x00001CDC
  REGISTER ra = 0x00001CDC

I have also been working on a rust-based GUI debugger for these trace files, where you can look at known memory (usually the stack) and register file contents at a given packet by walking the packets up until the point you’re interested in. It was an excuse to get to know Rust a bit more, but it’s not completely functional and I use the command line C# version more.

The easiest use for this is the physical button for dumping the traces. When bringing up some new software on the SoC it rarely works first time and end up in an infinite loop of some sort. Using the STAGE_FETCH packets which contain the PC I can look to an objdump and see immediately where we are executing without impacting upon the execution of the code itself.

Using the data to debug issues

Now to spoil a bit of the upcoming RPU Interrupts/Zephyr post with an example of how these traces have helped me. But I think an example of a real problem the trace dumps helped solve is required.

After implementing external timer interrupts, invalid instruction interrupts, system calls – and fixed a ton of issues – I had the Zephyr Dining Philosophers sample running on RPU in all it’s threaded, synchronized, glory.

Why do I need invalid instruction interrupts? Because RPU does not implement the M RISC-V extension. So multiply and divide hardware does not exist. Sadly, somewhere in the Zephyr build system, there is assembly with mul and div instructions. I needed invalid instruction interrupts in order to trap into an exception handler which could software emulate the instruction, write the result back into the context, so that when we returned from the interrupt to PC+4 the new value for the destination register would be written back.

It’s pretty funny to think that for me, implementing that was easier than trying to fix a build system to compile for the architecture intended.

Anyway, I was performing long-running tests of dining philosophers, when I hit the fatal error exception handler for trying to emulate an instruction it didn’t understand. I was able to replicate it, but it could take hours of running before it happened. The biggest issue? The instruction we were trying to emulate was at PC 0x00000010 – the start of the exception handler!

So, I set up the CPU trace trigger to activate on the instruction that branches to print that “FATAL: Reg is bad” message, started the FPGA running, and left the C# app to capture any trace dumps. After a few hours the issue occurred, and we had our CPU trace of the 4096 cycles leading up to the fatal error. Some hundreds of cycles before the dump initiated, we have the following output.

Packet 00,    STAGE_FETCH , 0x00000328 REG_WR  INT_EN   , :
Packet 01,    STAGE_FETCH , 0x00000328 REG_WR  INT_EN  EXT_INT   , :
Packet 02,    STAGE_FETCH , 0x00000328 REG_WR  INT_EN  LINT_INT  EXT_INT  EXT_INT_ACK   , :
Packet 03,    STAGE_FETCH , 0x00000328 REG_WR  INT_EN  LINT_INT  EXT_INT_ACK   , :
Packet 04,   STAGE_DECODE , 0x02A5D5B3 REG_WR  INT_EN  LINT_INT  EXT_INT_ACK   , :
Packet 05,      STAGE_ALU , 0x0000000B REG_WR  INT_EN  LINT_INT  EXT_INT_ACK  DECODE_INT  IS_ILLEGAL_INST , :
Packet 06,STAGE_WRITEBACK , 0xCDC1FEF1 INT_EN  LINT_INT  EXT_INT_ACK  DECODE_INT  IS_ILLEGAL_INST , :
Packet 07,    STAGE_STALL , 0x0000032C INT_EN  LINT_RST  LINT_INT  EXT_INT_ACK  DECODE_INT  IS_ILLEGAL_INST , :
Packet 08,    STAGE_STALL , 0x0000032C INT_EN  LINT_RST  LINT_INT  DECODE_INT  IS_ILLEGAL_INST , :
Packet 09,    STAGE_STALL , 0x00000010 LINT_INT  DECODE_INT  IS_ILLEGAL_INST , :
Packet 10,    STAGE_FETCH , 0x00000010 LINT_INT  DECODE_INT  IS_ILLEGAL_INST , :
Packet 11,    STAGE_FETCH , 0x00000010 LINT_INT  DECODE_INT  IS_ILLEGAL_INST , :
Packet 12,    STAGE_FETCH , 0x00000010 DECODE_INT  IS_ILLEGAL_INST , :
Packet 13,    STAGE_FETCH , 0x00000010 INT_EN  LINT_INT  DECODE_INT  DECODE_INT_ACK  IS_ILLEGAL_INST , :
Packet 14,    STAGE_FETCH , 0x00000010 INT_EN  LINT_INT  DECODE_INT_ACK  IS_ILLEGAL_INST , :
Packet 15,    STAGE_FETCH , 0x00000010 INT_EN  LINT_INT  DECODE_INT_ACK  IS_ILLEGAL_INST , :
Packet 16,   STAGE_DECODE , 0xFB010113 INT_EN  LINT_INT  DECODE_INT_ACK  IS_ILLEGAL_INST , :
Packet 17,      STAGE_ALU , 0x00000002 INT_EN  LINT_INT  DECODE_INT_ACK  IS_ILLEGAL_INST , :
Packet 18,STAGE_WRITEBACK , 0x00004F60 REG_WR  INT_EN  LINT_INT  DECODE_INT_ACK  IS_ILLEGAL_INST , :
Packet 19,    STAGE_STALL , 0x00000014 REG_WR  INT_EN  LINT_RST  LINT_INT  DECODE_INT_ACK  IS_ILLEGAL_INST , :
Packet 20,    STAGE_STALL , 0x00000014 REG_WR  INT_EN  LINT_RST  LINT_INT  IS_ILLEGAL_INST , :
Packet 21,    STAGE_STALL , 0xABCDEF01 REG_WR  INT_EN  LINT_INT  IS_ILLEGAL_INST , :
Packet 22,    STAGE_FETCH , 0xABCDEF01 REG_WR  INT_EN  LINT_INT  IS_ILLEGAL_INST , :
Packet 23,    STAGE_FETCH , 0xABCDEF01 REG_WR  INT_EN  LINT_INT  IS_ILLEGAL_INST , :
Packet 24,    STAGE_FETCH , 0xABCDEF01 REG_WR  IS_ILLEGAL_INST , :
Packet 25,    STAGE_FETCH , 0x00000078, :

What on earth is happening here? This is a lesson as to why interrupts have priorities ๐Ÿ™‚

Packet 00,    FETCH, 0x00000328 REG_WR 
Packet 01,    FETCH, 0x00000328 REG_WR                     EXT_INT 
Packet 02,    FETCH, 0x00000328 REG_WR           LINT_INT  EXT_INT  EXT_INT_ACK 
Packet 03,    FETCH, 0x00000328 REG_WR           LINT_INT           EXT_INT_ACK 
Packet 05,      ALU, 0x0000000B REG_WR           LINT_INT           EXT_INT_ACK  DECODE_INT                  IS_ILLEGAL_INST
Packet 06,WRITEBACK, 0xCDC1FEF1                  LINT_INT           EXT_INT_ACK  DECODE_INT                  IS_ILLEGAL_INST
Packet 07,    STALL, 0x0000032C        LINT_RST  LINT_INT           EXT_INT_ACK  DECODE_INT                  IS_ILLEGAL_INST
Packet 08,    STALL, 0x0000032C        LINT_RST  LINT_INT                        DECODE_INT                  IS_ILLEGAL_INST
Packet 09,    STALL, 0x00000010                  LINT_INT                        DECODE_INT                  IS_ILLEGAL_INST
Packet 12,    FETCH, 0x00000010                                                  DECODE_INT                  IS_ILLEGAL_INST
Packet 13,    FETCH, 0x00000010                  LINT_INT                        DECODE_INT  DECODE_INT_ACK  IS_ILLEGAL_INST
Packet 14,    FETCH, 0x00000010                  LINT_INT                                    DECODE_INT_ACK  IS_ILLEGAL_INST 
Packet 16,   DECODE, 0xFB010113                  LINT_INT                                    DECODE_INT_ACK  IS_ILLEGAL_INST
Packet 17,      ALU, 0x00000002                  LINT_INT                                    DECODE_INT_ACK  IS_ILLEGAL_INST
Packet 18,WRITEBACK, 0x00004F60 REG_WR           LINT_INT                                    DECODE_INT_ACK  IS_ILLEGAL_INST
Packet 19,    STALL, 0x00000014 REG_WR LINT_RST  LINT_INT                                    DECODE_INT_ACK  IS_ILLEGAL_INST
Packet 20,    STALL, 0x00000014 REG_WR LINT_RST  LINT_INT                                                    IS_ILLEGAL_INST
Packet 21,    STALL, 0xABCDEF01 REG_WR           LINT_INT                                                    IS_ILLEGAL_INST
Packet 24,    FETCH, 0xABCDEF01 REG_WR                                                                       IS_ILLEGAL_INST
Packet 25,    FETCH, 0x00000078

I’ve tried to reduce the trace down to minimum and lay it out so it makes sense. There are a few things you need to know about the RPU exception system which have yet to be discussed:

Each Core has a Local Interrupt Controller (LINT) which can accept interrupts at any stage of execution, provide the ACK signal to let the requester know it’s been accepted, and then at a safe point pass it on to the Control Unit to initiate transfer of execution to the exception vector. This transfer can only happen after a writeback, hence the STALL stages as it’s set up before fetching the first instruction of the exception vector at 0x00000010. If the LINT sees external interrupts requests (EXT_INT – timer interrupts) at the same time as decoder interrupts for invalid instruction, it will always choose the decoder above anything – as that needs immediately handled.

And here is what happens above:

  1. We are fetching PC 0x00000328, which happens to be an unsupported instruction which will be emulated by our invalid instruction handler.
  2. As we are fetching, and external timer interrupt fires (Packet 01)
  3. The LINT acknoledges the external interrupt as there is no higher priority request pending, and signals to the control unit an int is pending LINT_INT (Packet 2)
  4. As we wait for the WRITEBACK phase for the control unit to transfer to exception vector, PC 0x00000328 decodes as an illegal instruction and DECODER_INT is requested (Packet 5)
  5. LINT cannot acknowledge the decoder int as the control unit can only handle a single interrupt at a time, and its waiting to handle the external interrupt.
  6. The control unit accepts the external LINT_INT, and stalls for transfer to exception vector, and resets LINT so it can accept new requests (Packet 7).
  7. We start fetching the interrupt vector 0x00000010 (Packet 12)
  8. The LINT sees the DECODE_INT and immediately accepts and acknowledges.
  9. The control unit accepts the LINT_INT, stalls for transfer to exception vector, with the PC of the exception being set to 0x00000010 (Packet 20).
  10. Everything breaks, the PC get set to a value in flux, which just so happened to be in the exception vector (Packet 25).

In short, if an external interrupt fires during the fetch stage of an illegal instruction, the illegal instruction will not be handled correctly and state is corrupted.

Easily fixed with some further enable logic for external interrupts to only be accepted after fetch and decode. But one hell is an issue to find without the CPU trace dumps!

Finishing up

So, as you can see, trace dumps are an great feature to have in RPU. A very simple implementation can yield enough information to work with on problems where the simulator just is not viable. With different trigger options, and the ability to customize the O_DBG signal to further narrow down issues under investigation, it’s invaluable. In fact, I’ll probably end up putting this system into any similarly complex FPGA project in the future. The HDL will shortly be submitted to the SoC github repo along with the updated core which supports interrupts.

Thanks for reading, as always – feel free to ask any follow up questions to me on twitter @domipheus.

The posts about how the interrupts were fully integrated into RPU are on their way!

Raspberry Pi 4 PCI Express: It actually works! USB3, SATA… GPUs?


Recently, Tomasz Mloduchowski posted a popular article on his blog detailing the steps he undertook to get access to the hidden PCIe interface of Raspberry Pi 4: the first Raspberry Pi to include PCIe in its design. After seeing his post, and realizing I was meaning to go buy a Raspberry Pi 4, it just seemed natural to try and replicate his results in the hope of taking it a bit further. I am known for Raspberry Pi Butchery, after all.

Before I tried desoldering anything, I set up my Pi for remote use; enabling SSH, WiFi, serial UART+ boot messages. The USB ports on the Pi board will not function after this modification, so this is super important.

Desoldering the USB3 chipset

As Tomasz lays out in his article, we need to remove the VL805 USB3 chip in order to access the PCIe interface. I used a hot air soldering station at low volume and medium-high temperature, with small nozzle head in order to not disturb the components nearby. I used flux along the edges and after a while the chip came away.

I tried to remove the solder from the large pad by mixing some low-temperature solder paste into whats there, but it’s not needed. Just cover it with capton or poor electrical tape. I used electrical tape just to make seeing the small wires I’d be soldering above it easier to see.

The pins to solder

The VL805 datasheet is confidential which makes posting parts of it here tricky. However, an image search for “VL805-Q6 QFN68” may yield interesting results for those interested in finding out more. The pins we are interested in, are as follows (note differing polarity from Tomasz’s work):

I used 0.1mm enameled wire, with each differential pair cut to nearly the same length. Tinning the end of the wire by scraping with a knife and dipping in a molten solder ball makes soldering to the pads we need easier. Holding the wires down with kapton tape, and using flux, with the smallest iron tip I had made the job just bearable under a microscope.

The rather untidy looking result:

First attempts

I was using a cheap PCIe riser as my “first interface” which was then to be connected to a PCIe switch card, which would then enable another 4 PCie slots. This first socket, as Tomasz mentioned in his article, needs the PCIe Reset pin pulled to 3v3, and both Reset and Wake signal traces to the USB socket cut. Note that whilst we now know where the PCIE Reset line is on the Pi, I have not needed to connect this as yet.

The first attempt to boot with this setup resulted in the Pi not managing to boot at all. After some wiggling of the PCIe slot, the raspberry Pi booted, but no devices were shown when running lspci (lspci can be installed via apt-get). The third attempt, however, after some professional wiggling of the PCIe slot, resulted in success! A booted Pi, with a PCIe switch!

[email protected]:~ $ sudo lspci
00:00.0 PCI bridge: Broadcom Limited Device 2711 (rev 10)
01:00.0 PCI bridge: ASMedia Technology Inc. ASM1184e PCIe Switch Port

However, no devices were detected beyond the ASM1184e switch. Even a USB3 PCIe card using the same VL805 chip that I removed refused to detect. Running dmesg on the pi to get some driver details, I saw that whilst the PCIe link was active, and some busses were being assigned to the switch – it said that devices behind the bridge would not be usable due to bus IDs.

pci 0000:01:00.0: [1b21:1184] type 01 class 0x060400
pci 0000:01:00.0: enabling Extended Tags
pci 0000:01:00.0: PME# supported from D0 D3hot D3cold
PCI: bus1: Fast back to back transfers disabled
pci 0000:01:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
pci_bus 0000:02: busn_res: can not insert [bus 02-01] under [bus 01] (conflicts with (null) [bus 01])
PCI: bus2: Fast back to back transfers enabled
pci_bus 0000:02: busn_res: [bus 02-01] end is updated to 02
pci_bus 0000:02: busn_res: can not insert [bus 02] under [bus 01] (conflicts with (null) [bus 01])
pci 0000:01:00.0: devices behind bridge are unusable because [bus 02] cannot be assigned for them
pci_bus 0000:01: busn_res: [bus 01] end can not be updated to 02

The great thing about linux is that the source code is just there to dig into, and after finding where that “devices behind bridge are unusable” warning is printed I further discovered that the range of assignable busses can be limited by the Device Tree linux uses.

Device Trees

Device trees are simply a description of the hardware which is passed to the Linux kernel on boot. It has all the devices listed; their driver compatabilities, memory mappings, and configuration. It is particularly useful for describing peripherals which may not be discoverable via conventional means. On the root directory of your Raspbian Raspberry Pi boot SD volume, you will find bcm2711-rpi-4-b.dtb – which is the Compiled Device Tree binary. This binary blob is not user readable, but thankfully we can use the Device Tree Compiler to decompile it into a readable form. I did all of this within Windows Subsystem for Linux.

dtc -I dtb -O dts -o bcm2711-rpi-4-b_edit.dts bcm2711-rpi-4-b.dtb

That command will decompile into bcm2711-rpi-4-b_edit.dts. Searching this file for “pci” we find an entry for the PCIe – and it has a definition for bus-range which limits the bus IDs from 0 to 1. I change the entry from “<0x0 0x1>” to “<0x0 0xff>” and recompile to the binary form, and place that on the Raspbian SD card overwriting the default.

Command to recompile:

dtc -I dts -O dtb -o bcm2711-rpi-4-b.dtb bcm2711-rpi-4-b_edit.dts 

Success!

4 More PCIe busses!

And then I plugged some VL805-based USB3 cards in (one with a chained Network Interface). The setup looks as follows:

I have my Motorola LapDock USB hub connected to the Pi via the PCIe USB controller. The keyboard and trackpad work great!

Other Devices

I have a SATA controller based on a JMicron JMB363, which is detected correctly but there is no driver to load. This will require some linux driver/kernel fiddling to get a driver loaded correctly – but it’s very promising!

[email protected]:~$ lspci
00:00.0 PCI bridge: Broadcom Limited Device 2711 (rev 10)
01:00.0 PCI bridge: ASMedia Technology Inc. ASM1184e PCIe Switch Port
02:01.0 PCI bridge: ASMedia Technology Inc. ASM1184e PCIe Switch Port
02:03.0 PCI bridge: ASMedia Technology Inc. ASM1184e PCIe Switch Port
02:05.0 PCI bridge: ASMedia Technology Inc. ASM1184e PCIe Switch Port
02:07.0 PCI bridge: ASMedia Technology Inc. ASM1184e PCIe Switch Port
03:00.0 SATA controller: JMicron Technology Corp. JMB363 SATA/IDE Controller (rev 03)
05:00.0 USB controller: VIA Technologies, Inc. VL805 USB 3.0 Host Controller (rev 01)
06:00.0 USB controller: VIA Technologies, Inc. VL805 USB 3.0 Host Controller (rev 01)
[email protected]:~$ lspci -t
-[0000:00]---00.0-[01-06]----00.0-[02-06]--+-01.0-[03]----00.0
                                           +-03.0-[04]--
                                           +-05.0-[05]----00.0
                                           \-07.0-[06]----00.0

I also have tried some other fairly hilarious setups, including the following with a Radeon HD 7990 GPU, and another with a GTX 1060.

I’ll leave this here for now, whilst I read up on the Linux driver stack and how to build kernels for the raspberry pi ๐Ÿ™‚

[email protected]:/home/pi# lspci
00:00.0 PCI bridge: Broadcom Limited Device 2711 (rev 10)
01:00.0 PCI bridge: ASMedia Technology Inc. ASM1184e PCIe Switch Port
02:01.0 PCI bridge: ASMedia Technology Inc. ASM1184e PCIe Switch Port
02:03.0 PCI bridge: ASMedia Technology Inc. ASM1184e PCIe Switch Port
02:05.0 PCI bridge: ASMedia Technology Inc. ASM1184e PCIe Switch Port
02:07.0 PCI bridge: ASMedia Technology Inc. ASM1184e PCIe Switch Port
04:00.0 VGA compatible controller: NVIDIA Corporation GP106 [GeForce GTX 1060 6GB] (rev a1)
04:00.1 Audio device: NVIDIA Corporation GP106 High Definition Audio Controller (rev a1)
05:00.0 USB controller: VIA Technologies, Inc. VL805 USB 3.0 Host Controller (rev 01)
06:00.0 USB controller: VIA Technologies, Inc. VL805 USB 3.0 Host Controller (rev 01)

Thanks for reading! You can find me, as always, over on twitter @domipheus. Additionally, thanks to Tomasz Mloduchowski for his previous blogs which spurred my interest! I don’t fully understand why Tomasz had kernel panics using a VL805 board, but maybe it’s something to do with the Device Tree, and the fact I also had a PCIe switch.

Updates: