Ads have been removed from these pages.
Instead, please consider these charities for donation:

  • CLIC Sargent is the UK's leading cancer charity for children, young people and their families. Donate.
  • Blood Cancer UK is a UK based charity dedicated to funding research into all blood cancers including leukaemia, lymphoma and myeloma. Donate (Formerly Bloodwise).
  • Children’s Cancer and Leukaemia Group are a leading children’s cancer charity and the UK and Ireland’s professional association for those involved in the treatment and care of children with cancer. Donate.

Designing a RISC-V CPU in VHDL, Part 19: Adding Trace Dump Functionality

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

For those who follow me on twitter, you’ll have seen my recent tweets regarding Zephyr OS running on RPU. This was a huge amount of work to get running, most of it debugging on the FPGA itself. For those new to FPGA development, trying to debug on-chip can be a very difficult and frustrating experience. Generally, you want to debug in the simulator – but when potential issues are influenced by external devices such as SD cards, timer interrupts, and hundreds of millions of cycles into the boot process of an operating system – simulators may not be feasible.

Blog posts on the features I added to RPU to enable Zephyr booting, such as proper interrupts, exceptions and timers are coming – but it would not have been possible without a feature of the RPU SoC I have not yet discussed.

CPU Tracing

Most real processors will have hardware features built in, and one of the most useful low-level tools is tracing. This is when at an arbitrary time slice, low level details on the inner operation of the core are captured into some buffer, before being streamed elsewhere for analysis and state reconstruction later.

Note that this is a one-way flow of data. It is not interactive, like the debugging most developers know. It is mostly used for performance profiling but for RPU would be an ideal debugging aid.

Requirements

For the avoidance of doubt; I’m defining “A Trace” to be one block of valid data which is dumped to a host PC for analysis. For us, dumping will be streaming the data out via UART to a development PC. Multiple traces can be taken, but when the data transfer is initiated, the data needs to be a real representation of what occurred immediately preceding the request to dump the trace. The data contained in a trace is always being captured on the device in order that if a request is made, the data is available.

These requirements require a circular buffer which is continually recording the state. I’ll define exactly what the data is later – but for now, the data is defined as 64-bits per cycle. Plenty for a significant amount of state to be recorded, which will be required in order to perform meaningful analysis. We have a good amount of block rams on our Spartan 7-50 FPGA, so we can dedicate 32KB to this circular buffer quite easily. 64-bits into 32KB gives us 4,096 cycles of data. Not that much you’d think for a CPU running at over 100MHz, but you’d be surprised how quickly RPU falls over when it gets into an invalid state!

It goes without saying that our implementation needs to be non-intrusive. I’m not currently using the UART connected to the FTDI USB controller, as our logging output is displayed graphically via a text-mode display over HDMI. We can use this without impacting existing code. Our CPU core will expose a debug trace bus signal, which will be the data captured.

We’ve mentioned the buffer will be in a block ram; but one aspect of this is that we must be wary of the observer effect. This issue is very much an issue for performance profiling, as streaming out data from various devices usually goes through memory subsystems which will increase bandwidth requirements, and lead to more latency in the memory operations you are trying to trace. Our trace system should not effect the execution characteristics of the core at all. As we are using a development PC to receive the streamed data, we can completely segregate all data paths for the trace system, and remove the block ram from the memory mapped area which is currently used for code and data. With this block ram separate, we can ensure it’s set up as a true dual port ram with data width the native 64bit. One port will be for writing data from the CPU, on the CPU clock domain. The second port will be used for reading the data out at a rate which is dictated by the UART serial baud – much, much, slower. Doing this will ensure tracing will not impact execution of the core at any point, meaning our dumped data is much more valuable.

Lastly, we want to trigger these dumps at a point in time when we think an issue has occurred. Two immediate trigger types come to mind in addition to a manual button.

  1. Memory address
  2. Comparison with the data which is to be dumped; i.e, pipeline status flags combined with instruction types.

Implementation

The implementation is very simple. I’ve added a debug signal output to the CPU core entity. It’s 64 bits of data consisting of 32 bits of status bits, and a 32-bit data value as defined below.

This data is always being output by the core, changing every cycle. The data value can be various things; the PC when in a STAGE_FETCH state, the ALU result, the value we’re writing to rD in WRITEBACK, or a memory location during a load/store.

We only need two new processes for the system:

  • trace_streamout: manages the streaming out of bytes from the trace block ram
  • trace_en_check: inspects trigger conditions in order to initiate a trace dump which trace_streamout will handle

The BRAM used as the circular trace buffer is configured as 64-bits word length, with 4096 addresses. It was created using the Block Memory Generator, and has a read latency of 2 cycles.

We will use a clock cycle counter which already exists for dictating write locations into the BRAM. As it’s used as a circular buffer, we simply take the lower 12 bits of the clock counter as address into the BRAM.

Port A of the BRAM is the write port, with it’s address line tied to the bits noted above. It is enabled by a signal only when the trace_streamout process is idle. This is so when we do stream out the data we want, it’s not polluted with new data while our slow streamout to UART is active. That new data is effectively lost. As this port captures the cpu core O_DBG output, it’s clocked at the CPU core clock.

Port B is the read port. It’s clocked using the 100MHz reference clock (which also drives the UART – albeit then subsampled via a baud tick). It’s enabled when a streamout state is requested, and reads an address dictated by the trace_streamout process.

The trace_streamout process, when the current streamout state is idle, checks for a dump_enable signal. Upon seeing this signal, the last write address is latched from the lower cycle counter 12 bits. We also set a streamout location to be that last write +1. This location is what is fed into Port B of the BRAM/ circular trace buffer. When we change the read address on port B, we wait some cycles for the value to properly propagate out. During this preload stall, we also wait for the UART TX to become ready for more data. The transmission is performed significantly slower than the clock that trace_streamout runs at, and we cannot write to the TX buffer if it’s full.

The UART I’m using is provided by Xilinx and has an internal 16-byte buffer. We wait for a ready signal as then we know that writing our 8 bytes of debug data (remember, 64-bit) quickly into the UART TX will succeed. In addition to the 8 bytes of data, I also send 2 bytes of magic number data at the start of every 64-bit packet as an aid to the receiving logic; we can check the first two bytes for these values to ensure we’re synced correctly in order to parse the data eventually.

After the last byte is written, we increment our streamout location address. If it’s not equal to the last write address we latched previously, we move to the preload stall and move the next 8 bytes of trace data out. Otherwise, we are finished transmitting the entire trace buffer, so set out state back to idle and re-enable new trace data writes.

Triggering streamout

Triggering a dump using dump_enable can be done a variety of ways. I have a physical push-button on my Arty S7 board set to always enable a dump, which is useful to know where execution currently is in a program. I have also got a trigger on reading a certain memory address. This is good if there is an issue triggering an error which you can reliably track to a branch of code execution. Having a memory address in that code branch used as trigger will dump the cycles leading up to that branch being taken. There are one other types of trigger – relying on the cpu O_DBG signal itself, for example, triggering a dump when we encounter an decoder interrupt for an invalid instruction.

I hard-code these triggers in the VHDL currently, but it’s feasible that these can be configurable programmatically. The dump itself could also be triggered via a write to a specific MMIO location.

Parsing the data on the Debug PC

The UART TX on the FPGA is connected to the FTDI USB-UART bridge, which means when the FPGA design is active and the board is connected via USB, we can just open the COM port exposed via the USB device.

I made a simple C# command line utility which just dumps the packets in a readable form. It looks like this:

[22:54:19.6133781]Trace Packet, 00000054,  0xC3 40 ,   OPCODE_BRANCH ,     STAGE_FETCH , 0x000008EC INT_EN , :
[22:54:19.6143787]Trace Packet, 00000055,  0xD1 40 ,   OPCODE_BRANCH ,    STAGE_DECODE , 0x04C12083 INT_EN , :
[22:54:19.6153795]Trace Packet, 00000056,  0xE1 40 ,     OPCODE_LOAD ,       STAGE_ALU , 0x00000001 INT_EN , :
[22:54:19.6163794]Trace Packet, 00000057,  0xF1 C0 ,     OPCODE_LOAD ,    STAGE_MEMORY , 0x0000476C REG_WR  INT_EN , :
[22:54:19.6183798]Trace Packet, 00000058,  0x01 C0 ,     OPCODE_LOAD ,    STAGE_MEMORY , 0x0000476C REG_WR  INT_EN , :
[22:54:19.6183798]Trace Packet, 00000059,  0x11 C0 ,     OPCODE_LOAD ,    STAGE_MEMORY , 0x0000476C REG_WR  INT_EN , :
[22:54:19.6193799]Trace Packet, 00000060,  0x20 C0 ,     OPCODE_LOAD ,    STAGE_MEMORY , 0x0000476C REG_WR  INT_EN , :
[22:54:19.6203802]Trace Packet, 00000061,  0x31 C0 ,     OPCODE_LOAD ,    STAGE_MEMORY , 0x0000476C REG_WR  INT_EN , :
[22:54:19.6213808]Trace Packet, 00000062,  0x43 C0 ,     OPCODE_LOAD ,    STAGE_MEMORY , 0x0000476C REG_WR  INT_EN , :
[22:54:19.6213808]Trace Packet, 00000063,  0x51 C0 ,     OPCODE_LOAD , STAGE_WRITEBACK , 0x00001CDC REG_WR  INT_EN , :

You can see some data given by the utility such as timestamps and a packet ID. Everything else is derived from flags in the trace data for that cycle.

Later I added some additional functionality, like parsing register destinations and outputting known register/memory values to aid when going over the output.

[22:54:19.6213808]Trace Packet, 00000062,  0x43 C0 ,     OPCODE_LOAD ,    STAGE_MEMORY , 0x0000476C REG_WR  INT_EN , :
[22:54:19.6213808]Trace Packet, 00000063,  0x51 C0 ,     OPCODE_LOAD , STAGE_WRITEBACK , 0x00001CDC REG_WR  INT_EN , :
  MEMORY 0x0000476C = 0x00001CDC
  REGISTER ra = 0x00001CDC

I have also been working on a rust-based GUI debugger for these trace files, where you can look at known memory (usually the stack) and register file contents at a given packet by walking the packets up until the point you’re interested in. It was an excuse to get to know Rust a bit more, but it’s not completely functional and I use the command line C# version more.

The easiest use for this is the physical button for dumping the traces. When bringing up some new software on the SoC it rarely works first time and end up in an infinite loop of some sort. Using the STAGE_FETCH packets which contain the PC I can look to an objdump and see immediately where we are executing without impacting upon the execution of the code itself.

Using the data to debug issues

Now to spoil a bit of the upcoming RPU Interrupts/Zephyr post with an example of how these traces have helped me. But I think an example of a real problem the trace dumps helped solve is required.

After implementing external timer interrupts, invalid instruction interrupts, system calls – and fixed a ton of issues – I had the Zephyr Dining Philosophers sample running on RPU in all it’s threaded, synchronized, glory.

Why do I need invalid instruction interrupts? Because RPU does not implement the M RISC-V extension. So multiply and divide hardware does not exist. Sadly, somewhere in the Zephyr build system, there is assembly with mul and div instructions. I needed invalid instruction interrupts in order to trap into an exception handler which could software emulate the instruction, write the result back into the context, so that when we returned from the interrupt to PC+4 the new value for the destination register would be written back.

It’s pretty funny to think that for me, implementing that was easier than trying to fix a build system to compile for the architecture intended.

Anyway, I was performing long-running tests of dining philosophers, when I hit the fatal error exception handler for trying to emulate an instruction it didn’t understand. I was able to replicate it, but it could take hours of running before it happened. The biggest issue? The instruction we were trying to emulate was at PC 0x00000010 – the start of the exception handler!

So, I set up the CPU trace trigger to activate on the instruction that branches to print that “FATAL: Reg is bad” message, started the FPGA running, and left the C# app to capture any trace dumps. After a few hours the issue occurred, and we had our CPU trace of the 4096 cycles leading up to the fatal error. Some hundreds of cycles before the dump initiated, we have the following output.

Packet 00,    STAGE_FETCH , 0x00000328 REG_WR  INT_EN   , :
Packet 01,    STAGE_FETCH , 0x00000328 REG_WR  INT_EN  EXT_INT   , :
Packet 02,    STAGE_FETCH , 0x00000328 REG_WR  INT_EN  LINT_INT  EXT_INT  EXT_INT_ACK   , :
Packet 03,    STAGE_FETCH , 0x00000328 REG_WR  INT_EN  LINT_INT  EXT_INT_ACK   , :
Packet 04,   STAGE_DECODE , 0x02A5D5B3 REG_WR  INT_EN  LINT_INT  EXT_INT_ACK   , :
Packet 05,      STAGE_ALU , 0x0000000B REG_WR  INT_EN  LINT_INT  EXT_INT_ACK  DECODE_INT  IS_ILLEGAL_INST , :
Packet 06,STAGE_WRITEBACK , 0xCDC1FEF1 INT_EN  LINT_INT  EXT_INT_ACK  DECODE_INT  IS_ILLEGAL_INST , :
Packet 07,    STAGE_STALL , 0x0000032C INT_EN  LINT_RST  LINT_INT  EXT_INT_ACK  DECODE_INT  IS_ILLEGAL_INST , :
Packet 08,    STAGE_STALL , 0x0000032C INT_EN  LINT_RST  LINT_INT  DECODE_INT  IS_ILLEGAL_INST , :
Packet 09,    STAGE_STALL , 0x00000010 LINT_INT  DECODE_INT  IS_ILLEGAL_INST , :
Packet 10,    STAGE_FETCH , 0x00000010 LINT_INT  DECODE_INT  IS_ILLEGAL_INST , :
Packet 11,    STAGE_FETCH , 0x00000010 LINT_INT  DECODE_INT  IS_ILLEGAL_INST , :
Packet 12,    STAGE_FETCH , 0x00000010 DECODE_INT  IS_ILLEGAL_INST , :
Packet 13,    STAGE_FETCH , 0x00000010 INT_EN  LINT_INT  DECODE_INT  DECODE_INT_ACK  IS_ILLEGAL_INST , :
Packet 14,    STAGE_FETCH , 0x00000010 INT_EN  LINT_INT  DECODE_INT_ACK  IS_ILLEGAL_INST , :
Packet 15,    STAGE_FETCH , 0x00000010 INT_EN  LINT_INT  DECODE_INT_ACK  IS_ILLEGAL_INST , :
Packet 16,   STAGE_DECODE , 0xFB010113 INT_EN  LINT_INT  DECODE_INT_ACK  IS_ILLEGAL_INST , :
Packet 17,      STAGE_ALU , 0x00000002 INT_EN  LINT_INT  DECODE_INT_ACK  IS_ILLEGAL_INST , :
Packet 18,STAGE_WRITEBACK , 0x00004F60 REG_WR  INT_EN  LINT_INT  DECODE_INT_ACK  IS_ILLEGAL_INST , :
Packet 19,    STAGE_STALL , 0x00000014 REG_WR  INT_EN  LINT_RST  LINT_INT  DECODE_INT_ACK  IS_ILLEGAL_INST , :
Packet 20,    STAGE_STALL , 0x00000014 REG_WR  INT_EN  LINT_RST  LINT_INT  IS_ILLEGAL_INST , :
Packet 21,    STAGE_STALL , 0xABCDEF01 REG_WR  INT_EN  LINT_INT  IS_ILLEGAL_INST , :
Packet 22,    STAGE_FETCH , 0xABCDEF01 REG_WR  INT_EN  LINT_INT  IS_ILLEGAL_INST , :
Packet 23,    STAGE_FETCH , 0xABCDEF01 REG_WR  INT_EN  LINT_INT  IS_ILLEGAL_INST , :
Packet 24,    STAGE_FETCH , 0xABCDEF01 REG_WR  IS_ILLEGAL_INST , :
Packet 25,    STAGE_FETCH , 0x00000078, :

What on earth is happening here? This is a lesson as to why interrupts have priorities 🙂

Packet 00,    FETCH, 0x00000328 REG_WR 
Packet 01,    FETCH, 0x00000328 REG_WR                     EXT_INT 
Packet 02,    FETCH, 0x00000328 REG_WR           LINT_INT  EXT_INT  EXT_INT_ACK 
Packet 03,    FETCH, 0x00000328 REG_WR           LINT_INT           EXT_INT_ACK 
Packet 05,      ALU, 0x0000000B REG_WR           LINT_INT           EXT_INT_ACK  DECODE_INT                  IS_ILLEGAL_INST
Packet 06,WRITEBACK, 0xCDC1FEF1                  LINT_INT           EXT_INT_ACK  DECODE_INT                  IS_ILLEGAL_INST
Packet 07,    STALL, 0x0000032C        LINT_RST  LINT_INT           EXT_INT_ACK  DECODE_INT                  IS_ILLEGAL_INST
Packet 08,    STALL, 0x0000032C        LINT_RST  LINT_INT                        DECODE_INT                  IS_ILLEGAL_INST
Packet 09,    STALL, 0x00000010                  LINT_INT                        DECODE_INT                  IS_ILLEGAL_INST
Packet 12,    FETCH, 0x00000010                                                  DECODE_INT                  IS_ILLEGAL_INST
Packet 13,    FETCH, 0x00000010                  LINT_INT                        DECODE_INT  DECODE_INT_ACK  IS_ILLEGAL_INST
Packet 14,    FETCH, 0x00000010                  LINT_INT                                    DECODE_INT_ACK  IS_ILLEGAL_INST 
Packet 16,   DECODE, 0xFB010113                  LINT_INT                                    DECODE_INT_ACK  IS_ILLEGAL_INST
Packet 17,      ALU, 0x00000002                  LINT_INT                                    DECODE_INT_ACK  IS_ILLEGAL_INST
Packet 18,WRITEBACK, 0x00004F60 REG_WR           LINT_INT                                    DECODE_INT_ACK  IS_ILLEGAL_INST
Packet 19,    STALL, 0x00000014 REG_WR LINT_RST  LINT_INT                                    DECODE_INT_ACK  IS_ILLEGAL_INST
Packet 20,    STALL, 0x00000014 REG_WR LINT_RST  LINT_INT                                                    IS_ILLEGAL_INST
Packet 21,    STALL, 0xABCDEF01 REG_WR           LINT_INT                                                    IS_ILLEGAL_INST
Packet 24,    FETCH, 0xABCDEF01 REG_WR                                                                       IS_ILLEGAL_INST
Packet 25,    FETCH, 0x00000078

I’ve tried to reduce the trace down to minimum and lay it out so it makes sense. There are a few things you need to know about the RPU exception system which have yet to be discussed:

Each Core has a Local Interrupt Controller (LINT) which can accept interrupts at any stage of execution, provide the ACK signal to let the requester know it’s been accepted, and then at a safe point pass it on to the Control Unit to initiate transfer of execution to the exception vector. This transfer can only happen after a writeback, hence the STALL stages as it’s set up before fetching the first instruction of the exception vector at 0x00000010. If the LINT sees external interrupts requests (EXT_INT – timer interrupts) at the same time as decoder interrupts for invalid instruction, it will always choose the decoder above anything – as that needs immediately handled.

And here is what happens above:

  1. We are fetching PC 0x00000328, which happens to be an unsupported instruction which will be emulated by our invalid instruction handler.
  2. As we are fetching, and external timer interrupt fires (Packet 01)
  3. The LINT acknoledges the external interrupt as there is no higher priority request pending, and signals to the control unit an int is pending LINT_INT (Packet 2)
  4. As we wait for the WRITEBACK phase for the control unit to transfer to exception vector, PC 0x00000328 decodes as an illegal instruction and DECODER_INT is requested (Packet 5)
  5. LINT cannot acknowledge the decoder int as the control unit can only handle a single interrupt at a time, and its waiting to handle the external interrupt.
  6. The control unit accepts the external LINT_INT, and stalls for transfer to exception vector, and resets LINT so it can accept new requests (Packet 7).
  7. We start fetching the interrupt vector 0x00000010 (Packet 12)
  8. The LINT sees the DECODE_INT and immediately accepts and acknowledges.
  9. The control unit accepts the LINT_INT, stalls for transfer to exception vector, with the PC of the exception being set to 0x00000010 (Packet 20).
  10. Everything breaks, the PC get set to a value in flux, which just so happened to be in the exception vector (Packet 25).

In short, if an external interrupt fires during the fetch stage of an illegal instruction, the illegal instruction will not be handled correctly and state is corrupted.

Easily fixed with some further enable logic for external interrupts to only be accepted after fetch and decode. But one hell is an issue to find without the CPU trace dumps!

Finishing up

So, as you can see, trace dumps are an great feature to have in RPU. A very simple implementation can yield enough information to work with on problems where the simulator just is not viable. With different trigger options, and the ability to customize the O_DBG signal to further narrow down issues under investigation, it’s invaluable. In fact, I’ll probably end up putting this system into any similarly complex FPGA project in the future. The HDL will shortly be submitted to the SoC github repo along with the updated core which supports interrupts.

Thanks for reading, as always – feel free to ask any follow up questions to me on twitter @domipheus.

The posts about how the interrupts were fully integrated into RPU are on their way!

Raspberry Pi 4 PCI Express: It actually works! USB3, SATA… GPUs?


Recently, Tomasz Mloduchowski posted a popular article on his blog detailing the steps he undertook to get access to the hidden PCIe interface of Raspberry Pi 4: the first Raspberry Pi to include PCIe in its design. After seeing his post, and realizing I was meaning to go buy a Raspberry Pi 4, it just seemed natural to try and replicate his results in the hope of taking it a bit further. I am known for Raspberry Pi Butchery, after all.

Before I tried desoldering anything, I set up my Pi for remote use; enabling SSH, WiFi, serial UART+ boot messages. The USB ports on the Pi board will not function after this modification, so this is super important.

Desoldering the USB3 chipset

As Tomasz lays out in his article, we need to remove the VL805 USB3 chip in order to access the PCIe interface. I used a hot air soldering station at low volume and medium-high temperature, with small nozzle head in order to not disturb the components nearby. I used flux along the edges and after a while the chip came away.

I tried to remove the solder from the large pad by mixing some low-temperature solder paste into whats there, but it’s not needed. Just cover it with capton or poor electrical tape. I used electrical tape just to make seeing the small wires I’d be soldering above it easier to see.

The pins to solder

The VL805 datasheet is confidential which makes posting parts of it here tricky. However, an image search for “VL805-Q6 QFN68” may yield interesting results for those interested in finding out more. The pins we are interested in, are as follows (note differing polarity from Tomasz’s work):

I used 0.1mm enameled wire, with each differential pair cut to nearly the same length. Tinning the end of the wire by scraping with a knife and dipping in a molten solder ball makes soldering to the pads we need easier. Holding the wires down with kapton tape, and using flux, with the smallest iron tip I had made the job just bearable under a microscope.

The rather untidy looking result:

First attempts

I was using a cheap PCIe riser as my “first interface” which was then to be connected to a PCIe switch card, which would then enable another 4 PCie slots. This first socket, as Tomasz mentioned in his article, needs the PCIe Reset pin pulled to 3v3, and both Reset and Wake signal traces to the USB socket cut. Note that whilst we now know where the PCIE Reset line is on the Pi, I have not needed to connect this as yet.

The first attempt to boot with this setup resulted in the Pi not managing to boot at all. After some wiggling of the PCIe slot, the raspberry Pi booted, but no devices were shown when running lspci (lspci can be installed via apt-get). The third attempt, however, after some professional wiggling of the PCIe slot, resulted in success! A booted Pi, with a PCIe switch!

[email protected]:~ $ sudo lspci
00:00.0 PCI bridge: Broadcom Limited Device 2711 (rev 10)
01:00.0 PCI bridge: ASMedia Technology Inc. ASM1184e PCIe Switch Port

However, no devices were detected beyond the ASM1184e switch. Even a USB3 PCIe card using the same VL805 chip that I removed refused to detect. Running dmesg on the pi to get some driver details, I saw that whilst the PCIe link was active, and some busses were being assigned to the switch – it said that devices behind the bridge would not be usable due to bus IDs.

pci 0000:01:00.0: [1b21:1184] type 01 class 0x060400
pci 0000:01:00.0: enabling Extended Tags
pci 0000:01:00.0: PME# supported from D0 D3hot D3cold
PCI: bus1: Fast back to back transfers disabled
pci 0000:01:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
pci_bus 0000:02: busn_res: can not insert [bus 02-01] under [bus 01] (conflicts with (null) [bus 01])
PCI: bus2: Fast back to back transfers enabled
pci_bus 0000:02: busn_res: [bus 02-01] end is updated to 02
pci_bus 0000:02: busn_res: can not insert [bus 02] under [bus 01] (conflicts with (null) [bus 01])
pci 0000:01:00.0: devices behind bridge are unusable because [bus 02] cannot be assigned for them
pci_bus 0000:01: busn_res: [bus 01] end can not be updated to 02

The great thing about linux is that the source code is just there to dig into, and after finding where that “devices behind bridge are unusable” warning is printed I further discovered that the range of assignable busses can be limited by the Device Tree linux uses.

Device Trees

Device trees are simply a description of the hardware which is passed to the Linux kernel on boot. It has all the devices listed; their driver compatabilities, memory mappings, and configuration. It is particularly useful for describing peripherals which may not be discoverable via conventional means. On the root directory of your Raspbian Raspberry Pi boot SD volume, you will find bcm2711-rpi-4-b.dtb – which is the Compiled Device Tree binary. This binary blob is not user readable, but thankfully we can use the Device Tree Compiler to decompile it into a readable form. I did all of this within Windows Subsystem for Linux.

dtc -I dtb -O dts -o bcm2711-rpi-4-b_edit.dts bcm2711-rpi-4-b.dtb

That command will decompile into bcm2711-rpi-4-b_edit.dts. Searching this file for “pci” we find an entry for the PCIe – and it has a definition for bus-range which limits the bus IDs from 0 to 1. I change the entry from “<0x0 0x1>” to “<0x0 0xff>” and recompile to the binary form, and place that on the Raspbian SD card overwriting the default.

Command to recompile:

dtc -I dts -O dtb -o bcm2711-rpi-4-b.dtb bcm2711-rpi-4-b_edit.dts 

Success!

4 More PCIe busses!

And then I plugged some VL805-based USB3 cards in (one with a chained Network Interface). The setup looks as follows:

I have my Motorola LapDock USB hub connected to the Pi via the PCIe USB controller. The keyboard and trackpad work great!

Other Devices

I have a SATA controller based on a JMicron JMB363, which is detected correctly but there is no driver to load. This will require some linux driver/kernel fiddling to get a driver loaded correctly – but it’s very promising!

[email protected]:~$ lspci
00:00.0 PCI bridge: Broadcom Limited Device 2711 (rev 10)
01:00.0 PCI bridge: ASMedia Technology Inc. ASM1184e PCIe Switch Port
02:01.0 PCI bridge: ASMedia Technology Inc. ASM1184e PCIe Switch Port
02:03.0 PCI bridge: ASMedia Technology Inc. ASM1184e PCIe Switch Port
02:05.0 PCI bridge: ASMedia Technology Inc. ASM1184e PCIe Switch Port
02:07.0 PCI bridge: ASMedia Technology Inc. ASM1184e PCIe Switch Port
03:00.0 SATA controller: JMicron Technology Corp. JMB363 SATA/IDE Controller (rev 03)
05:00.0 USB controller: VIA Technologies, Inc. VL805 USB 3.0 Host Controller (rev 01)
06:00.0 USB controller: VIA Technologies, Inc. VL805 USB 3.0 Host Controller (rev 01)
[email protected]:~$ lspci -t
-[0000:00]---00.0-[01-06]----00.0-[02-06]--+-01.0-[03]----00.0
                                           +-03.0-[04]--
                                           +-05.0-[05]----00.0
                                           \-07.0-[06]----00.0

I also have tried some other fairly hilarious setups, including the following with a Radeon HD 7990 GPU, and another with a GTX 1060.

I’ll leave this here for now, whilst I read up on the Linux driver stack and how to build kernels for the raspberry pi 🙂

[email protected]:/home/pi# lspci
00:00.0 PCI bridge: Broadcom Limited Device 2711 (rev 10)
01:00.0 PCI bridge: ASMedia Technology Inc. ASM1184e PCIe Switch Port
02:01.0 PCI bridge: ASMedia Technology Inc. ASM1184e PCIe Switch Port
02:03.0 PCI bridge: ASMedia Technology Inc. ASM1184e PCIe Switch Port
02:05.0 PCI bridge: ASMedia Technology Inc. ASM1184e PCIe Switch Port
02:07.0 PCI bridge: ASMedia Technology Inc. ASM1184e PCIe Switch Port
04:00.0 VGA compatible controller: NVIDIA Corporation GP106 [GeForce GTX 1060 6GB] (rev a1)
04:00.1 Audio device: NVIDIA Corporation GP106 High Definition Audio Controller (rev a1)
05:00.0 USB controller: VIA Technologies, Inc. VL805 USB 3.0 Host Controller (rev 01)
06:00.0 USB controller: VIA Technologies, Inc. VL805 USB 3.0 Host Controller (rev 01)

Thanks for reading! You can find me, as always, over on twitter @domipheus. Additionally, thanks to Tomasz Mloduchowski for his previous blogs which spurred my interest! I don’t fully understand why Tomasz had kernel panics using a VL805 board, but maybe it’s something to do with the Device Tree, and the fact I also had a PCIe switch.

Updates:

Designing a RISC-V CPU in VHDL, Part 18: Control and Status Register Unit

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

You may remember after the switch from my own TPU ISA to RISC-V was made, I stated that interrupts were disabled. This was due to requirements for RISC-V style interrupt mechanisms not being compatible with what I’d done on TPU. In order to get Interrupts back into RPU, we need to go and implement the correct Control and Status Registers(CSRs) required for management of interrupts – enable bits, interrupt vectors, interrupt causes – they are all communicated via CSRs. As I want my core to execute 3rd-party code, this needs to be to spec!

In the time it’s taken me to start writing this article after doing the implementation, the latest draft RISC-V spec has moved the CSR instructions and definitions into it’s own Zicsr ISA extension. The extension defines 6 additional instructions for manipulating and querying the contents of these special registers, registers which define things such as privilege levels, interrupt causes, and timers. Timers are a bit of an oddity, which we’ll get back to later.

The CSRs within a RISC-V hardware thread (known as a hart) can influence the execution pipeline at many stages. If interrupts are enabled via enable bits contained in CSRs, additional checks are required after instruction writeback/retire in order to follow any pending interrupt or trap. One very important fact to take away from this is that the current values of particular CSRs are required to be known in many locations at multiple points in the instruction pipeline. This is a very real issue, complicated by differing privilege levels having their own local versions of these registers, so throughout designing the implementation of the RPU CSR unit, I wanted a very simple solution which worked well on my unpipelined core – not necessarily the best or most efficient design. This unit has been rather difficult to create, and the solution may be far from ideal. Of all the units I’ve designed for RPU, this is the one which I am continually questioning and looking for redesign possibilities.

The CSRs are accessed via a flat address space of 4096 entries. On our RV32I machine, the entries will be 32 bits wide. Great; lets use a block ram I hear you scream. Well, it’s not that simple. For RPU, the amount of CSRs we require is significantly less than the 4096 entries – and implementing all of them is not required. Whilst I’m sure some people will want 16KB of super fast storage for some evil code optimizations, we are very far from looking at optimizing RPU for speed – so, the initial implementation will look as follows:

  • The CSR unit will take operations as input, with data and CSR address.
  • A signal exists internally for each CSR we support.
  • The CSR unit will output current values of internal CSRs as required for CPU execution at the current privilege level.
  • The CSR unit will be multi-cycle in nature.
  • The CSR unit can raise exceptions for illegal access.

So, our unit will have signals for use when executing user CSR instructions, signals for notifying of events which auto-update registers, output signals for various CSR values needed mid-pipeline for normal CPU operation, and finally – some signals for raising exceptions – which will be needed later – when we check for situations like write operations to read-only CSR addresses.

The operations and CSR address will be provided by the decode stage of the pipeline, and take the 6 different CSR manipulation instructions and encode that into specific CSR unit control operations.

You can see from the spec that the CSR operations are listed as atomic. This does not really effect RPU with the current in-order pipeline, but may do in the future. The CSRRS and CSRRC (Atomic Read and [Set/Clear] Bits in CSR) instructions are read-modify-write, which instantly makes our CSR unit capable of multi-cycle operations. This will be the first pipeline stage that can take more than one cycle to execute, so is a bit of a milestone in complexity. Whilst there are only six instructions, specific values of rd and r1 can expand them to both read and write in the same operation.

All of the CSR instructions fall under the wide SYSTEM instruction type, so we add another set of checks in our decoder, and output translated CSR unit opcodes. The CSR Opcodes are generally just the funct3 instruction bits, with additional read/write flags. There are no real changes to the ALU – our CSR unit takes over any work for these operations.

The RISC-V privileged specification is where the real CSR listings exist, and the descriptions of their use. A few CSRs hold basic, read-only machine information such as vendor, architecture, implementation and hardware thread (hard) IDs. These CSRs are generally accessed only via the csr instructions, however some exist that need accessed at points in the pipeline. The Machine Status Register (mstatus) contains privilege mode and interrupt enable flags which the control unit of the CPU needs constant access to in order to make progress through the instruction pipe.

The CSR Unit

With the input and outputs requirements for the unit pretty clear, we can get started defining the basis for how our unit will be implemented. For RPU, we’ll have an internal signal for each relevant CSR, or manually hard-wire to values as required. Individual process blocks will be used for the various tasks required:

  • A “main” process block which handles any requested CSR operation.
  • A series of “update” processes which write automatically-updated CSRs, such as the instruction retired and cycle counters.
  • A “protection” process which will flag unauthorized access to CSR addresses.

At this point, we have not discussed unauthorized access to CSRs. The interrupt mechanism is not in place to handle them at this point in the blog, however, we will still check for them. Handily, the CSR address actually encodes read/write/privilege access requirements, defined by the following table:

So detecting and raising access exception signals when the operation is at odds with the current privilege level or address type is very simple. If it’s a write operation, and the address bits [11:10] == ’11’ then it’s an invalid access. Same with the privilege checks against the address bits in [9:8].

The “update” processes are, again, fairly simple.

The main process is defined as the following state machine.

As you can see, reads are quicker as they do not fully go into the modify/write phase. A full read-modify-write CSR operation will take 3 cycles. The result of the CSR read will be available 1 cycle later – the same latency as our current ALU operations. As the current latency of our pipeline means it’s over 2 cycles before the next CSR read could possibly occur, so we can have our unit work concurrently and not take up any more than the normal single cycle of our pipeline.

The diagram above shows the tightest possible pipeline timing for a CSR ReadModifyWrite, followed by a CSR read – and in reality the FETCH stage takes multiple cycles due to our memory latency, so currently it’s very safe to have the CSR unit run operations alongside the pipeline. As a quick aside, interrupts from here would only be raised on the first cycle when the CSR Operation was first seen, so they would be handled at the correct stages and not leak into subsequent instructions due to the unit running past the current set of pipeline stages.

A run in our simulation shows the expected behavior, with the write to CSR mtvec present after the 3 cycle write latency.

And the set bit operations also work:

Branching towards confusion

An issue arose after implementing the CSR unit into the existing pipeline, one which caused erroneous branching and forward progress when a CSR instruction immediately followed a branch. It took a while to track this down, although in hindsight with a better simulator testbench program it would have been obvious – which I will show:

Above is a shot of the simulator waveform, centered on a section of code with a branch instruction targeted at a CSR instruction. Instead of the instruction following the cycle CSR read being the cycleh read, there is a second execution of cycle CSR read, and then a branch to 0x00000038. The last signal in the waveform – shouldBranch – is the key here. The shouldBranch signal is controlled by the ALU, and in the first implementation of the CSR unit, the ALU was disabled completely if an CSR instruction was found by the decoder. This meant the shouldBranch signal was not reset after the previous branch (0x0480006f – j 7c) executed. What really needs to happen, is the ALU remain active, but understand that it has no side effects when a CSR operation is in place. Doing this change, results in the following – correct execution of the instruction sequence.

Conclusion

I’ve not really explained all of the CSR issues present in RISC-V during this part of the series. There is a whole section in the spec on Field Specifications, whereby bits within CSRs can behave as though there are no external side effects, however, bits can be hard wired to 0 on read. This is so that field edits which are currently reserved in an implementation will be able to run in future without unexpected consequences. Thankfully, while I did not go into the specifics here, my implementation allows for these operations to be implemented to spec. It’s just a matter of time.

This turned into quite a long one to write, on what I thought would be a fairly simple addition to the CPU. It was anything but simple! You can see the current VHDL code for the unit over on github.

Exceptions/Interrupts are coming next time, now that the relevant status registers are implemented. We’ll also touch on those pesky timer ‘CSRs’.

Thanks for reading, as always – feel free to ask any follow up questions to me on twitter @domipheus.

Flashback: Quake 3/RTCW Level Design

I recently discovered an old Level Design portfolio I had made whilst in school circa 2000-2002. This was before I started university and any foray into C/C++ coding. They all have a name watermark in a particularly awful font!

I had actually coded windows utility apps before this – in Visual Basic! Maybe I will document those in a future post.

For today, here is a list of some maps I made/released. The Quake 3 mod these were designed for was Quake 3 Fortress(Q3F), and the Return to Castle Wolfenstein(RTCW) mod was Wolftactics. I was not on the Q3F dev team – those maps are unreleased. However, I was part of the Wolftactics team – so those maps were released with the mod in 2003. For that mod, I did level design and most of the user interface scripting.

Finding these screenshots brought back great memories of the old modding communities I was part of. I feel this may be something that isn’t as widespread nowadays – but back then, modding was perfect for dipping toes into game development.

q3f_halls

2000 – Quake 3 Fortress, unreleased, but match tested

q3f_garden

2000 – Quake 3 Fortress, unreleased

q3f_generator

2001 – Quake 3 Fortress, unreleased

wt_mach

2002 – Wolftactics RTCW Mod – released

wt_forts

2002 – Wolftactics RTCW Mod – released

wt_rct

2002 – Wolftactics RTCW Mod – released

And some bonus items for those who scrolled to the end: my first taste of something I helped create being reviewed in the games press.

Ahh, the memories.