Raspberry Pi 4 PCI Express: It actually works! USB3, SATA… GPUs?


Recently, Tomasz Mloduchowski posted a popular article on his blog detailing the steps he undertook to get access to the hidden PCIe interface of Raspberry Pi 4: the first Raspberry Pi to include PCIe in its design. After seeing his post, and realizing I was meaning to go buy a Raspberry Pi 4, it just seemed natural to try and replicate his results in the hope of taking it a bit further. I am known for Raspberry Pi Butchery, after all.

Before I tried desoldering anything, I set up my Pi for remote use; enabling SSH, WiFi, serial UART+ boot messages. The USB ports on the Pi board will not function after this modification, so this is super important.

Desoldering the USB3 chipset

As Tomasz lays out in his article, we need to remove the VL805 USB3 chip in order to access the PCIe interface. I used a hot air soldering station at low volume and medium-high temperature, with small nozzle head in order to not disturb the components nearby. I used flux along the edges and after a while the chip came away.

I tried to remove the solder from the large pad by mixing some low-temperature solder paste into whats there, but it’s not needed. Just cover it with capton or poor electrical tape. I used electrical tape just to make seeing the small wires I’d be soldering above it easier to see.

The pins to solder

The VL805 datasheet is confidential which makes posting parts of it here tricky. However, an image search for “VL805-Q6 QFN68” may yield interesting results for those interested in finding out more. The pins we are interested in, are as follows (note differing polarity from Tomasz’s work):

I used 0.1mm enameled wire, with each differential pair cut to nearly the same length. Tinning the end of the wire by scraping with a knife and dipping in a molten solder ball makes soldering to the pads we need easier. Holding the wires down with kapton tape, and using flux, with the smallest iron tip I had made the job just bearable under a microscope.

The rather untidy looking result:

First attempts

I was using a cheap PCIe riser as my “first interface” which was then to be connected to a PCIe switch card, which would then enable another 4 PCie slots. This first socket, as Tomasz mentioned in his article, needs the PCIe Reset pin pulled to 3v3, and both Reset and Wake signal traces to the USB socket cut. Note that whilst we now know where the PCIE Reset line is on the Pi, I have not needed to connect this as yet.

The first attempt to boot with this setup resulted in the Pi not managing to boot at all. After some wiggling of the PCIe slot, the raspberry Pi booted, but no devices were shown when running lspci (lspci can be installed via apt-get). The third attempt, however, after some professional wiggling of the PCIe slot, resulted in success! A booted Pi, with a PCIe switch!

[email protected]:~ $ sudo lspci
00:00.0 PCI bridge: Broadcom Limited Device 2711 (rev 10)
01:00.0 PCI bridge: ASMedia Technology Inc. ASM1184e PCIe Switch Port

However, no devices were detected beyond the ASM1184e switch. Even a USB3 PCIe card using the same VL805 chip that I removed refused to detect. Running dmesg on the pi to get some driver details, I saw that whilst the PCIe link was active, and some busses were being assigned to the switch – it said that devices behind the bridge would not be usable due to bus IDs.

pci 0000:01:00.0: [1b21:1184] type 01 class 0x060400
pci 0000:01:00.0: enabling Extended Tags
pci 0000:01:00.0: PME# supported from D0 D3hot D3cold
PCI: bus1: Fast back to back transfers disabled
pci 0000:01:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
pci_bus 0000:02: busn_res: can not insert [bus 02-01] under [bus 01] (conflicts with (null) [bus 01])
PCI: bus2: Fast back to back transfers enabled
pci_bus 0000:02: busn_res: [bus 02-01] end is updated to 02
pci_bus 0000:02: busn_res: can not insert [bus 02] under [bus 01] (conflicts with (null) [bus 01])
pci 0000:01:00.0: devices behind bridge are unusable because [bus 02] cannot be assigned for them
pci_bus 0000:01: busn_res: [bus 01] end can not be updated to 02

The great thing about linux is that the source code is just there to dig into, and after finding where that “devices behind bridge are unusable” warning is printed I further discovered that the range of assignable busses can be limited by the Device Tree linux uses.

Device Trees

Device trees are simply a description of the hardware which is passed to the Linux kernel on boot. It has all the devices listed; their driver compatabilities, memory mappings, and configuration. It is particularly useful for describing peripherals which may not be discoverable via conventional means. On the root directory of your Raspbian Raspberry Pi boot SD volume, you will find bcm2711-rpi-4-b.dtb – which is the Compiled Device Tree binary. This binary blob is not user readable, but thankfully we can use the Device Tree Compiler to decompile it into a readable form. I did all of this within Windows Subsystem for Linux.

dtc -I dtb -O dts -o bcm2711-rpi-4-b_edit.dts bcm2711-rpi-4-b.dtb

That command will decompile into bcm2711-rpi-4-b_edit.dts. Searching this file for “pci” we find an entry for the PCIe – and it has a definition for bus-range which limits the bus IDs from 0 to 1. I change the entry from “<0x0 0x1>” to “<0x0 0xff>” and recompile to the binary form, and place that on the Raspbian SD card overwriting the default.

Command to recompile:

dtc -I dts -O dtb -o bcm2711-rpi-4-b.dtb bcm2711-rpi-4-b_edit.dts 

Success!

4 More PCIe busses!

And then I plugged some VL805-based USB3 cards in (one with a chained Network Interface). The setup looks as follows:

I have my Motorola LapDock USB hub connected to the Pi via the PCIe USB controller. The keyboard and trackpad work great!

Other Devices

I have a SATA controller based on a JMicron JMB363, which is detected correctly but there is no driver to load. This will require some linux driver/kernel fiddling to get a driver loaded correctly – but it’s very promising!

[email protected]:~$ lspci
00:00.0 PCI bridge: Broadcom Limited Device 2711 (rev 10)
01:00.0 PCI bridge: ASMedia Technology Inc. ASM1184e PCIe Switch Port
02:01.0 PCI bridge: ASMedia Technology Inc. ASM1184e PCIe Switch Port
02:03.0 PCI bridge: ASMedia Technology Inc. ASM1184e PCIe Switch Port
02:05.0 PCI bridge: ASMedia Technology Inc. ASM1184e PCIe Switch Port
02:07.0 PCI bridge: ASMedia Technology Inc. ASM1184e PCIe Switch Port
03:00.0 SATA controller: JMicron Technology Corp. JMB363 SATA/IDE Controller (rev 03)
05:00.0 USB controller: VIA Technologies, Inc. VL805 USB 3.0 Host Controller (rev 01)
06:00.0 USB controller: VIA Technologies, Inc. VL805 USB 3.0 Host Controller (rev 01)
[email protected]:~$ lspci -t
-[0000:00]---00.0-[01-06]----00.0-[02-06]--+-01.0-[03]----00.0
                                           +-03.0-[04]--
                                           +-05.0-[05]----00.0
                                           \-07.0-[06]----00.0

I also have tried some other fairly hilarious setups, including the following with a Radeon HD 7990 GPU, and another with a GTX 1060.

I’ll leave this here for now, whilst I read up on the Linux driver stack and how to build kernels for the raspberry pi ๐Ÿ™‚

[email protected]:/home/pi# lspci
00:00.0 PCI bridge: Broadcom Limited Device 2711 (rev 10)
01:00.0 PCI bridge: ASMedia Technology Inc. ASM1184e PCIe Switch Port
02:01.0 PCI bridge: ASMedia Technology Inc. ASM1184e PCIe Switch Port
02:03.0 PCI bridge: ASMedia Technology Inc. ASM1184e PCIe Switch Port
02:05.0 PCI bridge: ASMedia Technology Inc. ASM1184e PCIe Switch Port
02:07.0 PCI bridge: ASMedia Technology Inc. ASM1184e PCIe Switch Port
04:00.0 VGA compatible controller: NVIDIA Corporation GP106 [GeForce GTX 1060 6GB] (rev a1)
04:00.1 Audio device: NVIDIA Corporation GP106 High Definition Audio Controller (rev a1)
05:00.0 USB controller: VIA Technologies, Inc. VL805 USB 3.0 Host Controller (rev 01)
06:00.0 USB controller: VIA Technologies, Inc. VL805 USB 3.0 Host Controller (rev 01)

Thanks for reading! You can find me, as always, over on twitter @domipheus. Additionally, thanks to Tomasz Mloduchowski for his previous blogs which spurred my interest! I don’t fully understand why Tomasz had kernel panics using a VL805 board, but maybe it’s something to do with the Device Tree, and the fact I also had a PCIe switch.

Updates:

Designing a RISC-V CPU in VHDL, Part 18: Control and Status Register Unit

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and Iโ€™d recommend they are read before continuing.

You may remember after the switch from my own TPU ISA to RISC-V was made, I stated that interrupts were disabled. This was due to requirements for RISC-V style interrupt mechanisms not being compatible with what I’d done on TPU. In order to get Interrupts back into RPU, we need to go and implement the correct Control and Status Registers(CSRs) required for management of interrupts – enable bits, interrupt vectors, interrupt causes – they are all communicated via CSRs. As I want my core to execute 3rd-party code, this needs to be to spec!

In the time it’s taken me to start writing this article after doing the implementation, the latest draft RISC-V spec has moved the CSR instructions and definitions into it’s own Zicsr ISA extension. The extension defines 6 additional instructions for manipulating and querying the contents of these special registers, registers which define things such as privilege levels, interrupt causes, and timers. Timers are a bit of an oddity, which we’ll get back to later.

The CSRs within a RISC-V hardware thread (known as a hart) can influence the execution pipeline at many stages. If interrupts are enabled via enable bits contained in CSRs, additional checks are required after instruction writeback/retire in order to follow any pending interrupt or trap. One very important fact to take away from this is that the current values of particular CSRs are required to be known in many locations at multiple points in the instruction pipeline. This is a very real issue, complicated by differing privilege levels having their own local versions of these registers, so throughout designing the implementation of the RPU CSR unit, I wanted a very simple solution which worked well on my unpipelined core – not necessarily the best or most efficient design. This unit has been rather difficult to create, and the solution may be far from ideal. Of all the units I’ve designed for RPU, this is the one which I am continually questioning and looking for redesign possibilities.

The CSRs are accessed via a flat address space of 4096 entries. On our RV32I machine, the entries will be 32 bits wide. Great; lets use a block ram I hear you scream. Well, it’s not that simple. For RPU, the amount of CSRs we require is significantly less than the 4096 entries – and implementing all of them is not required. Whilst I’m sure some people will want 16KB of super fast storage for some evil code optimizations, we are very far from looking at optimizing RPU for speed – so, the initial implementation will look as follows:

  • The CSR unit will take operations as input, with data and CSR address.
  • A signal exists internally for each CSR we support.
  • The CSR unit will output current values of internal CSRs as required for CPU execution at the current privilege level.
  • The CSR unit will be multi-cycle in nature.
  • The CSR unit can raise exceptions for illegal access.

So, our unit will have signals for use when executing user CSR instructions, signals for notifying of events which auto-update registers, output signals for various CSR values needed mid-pipeline for normal CPU operation, and finally – some signals for raising exceptions – which will be needed later – when we check for situations like write operations to read-only CSR addresses.

The operations and CSR address will be provided by the decode stage of the pipeline, and take the 6 different CSR manipulation instructions and encode that into specific CSR unit control operations.

You can see from the spec that the CSR operations are listed as atomic. This does not really effect RPU with the current in-order pipeline, but may do in the future. The CSRRS and CSRRC (Atomic Read and [Set/Clear] Bits in CSR) instructions are read-modify-write, which instantly makes our CSR unit capable of multi-cycle operations. This will be the first pipeline stage that can take more than one cycle to execute, so is a bit of a milestone in complexity. Whilst there are only six instructions, specific values of rd and r1 can expand them to both read and write in the same operation.

All of the CSR instructions fall under the wide SYSTEM instruction type, so we add another set of checks in our decoder, and output translated CSR unit opcodes. The CSR Opcodes are generally just the funct3 instruction bits, with additional read/write flags. There are no real changes to the ALU – our CSR unit takes over any work for these operations.

The RISC-V privileged specification is where the real CSR listings exist, and the descriptions of their use. A few CSRs hold basic, read-only machine information such as vendor, architecture, implementation and hardware thread (hard) IDs. These CSRs are generally accessed only via the csr instructions, however some exist that need accessed at points in the pipeline. The Machine Status Register (mstatus) contains privilege mode and interrupt enable flags which the control unit of the CPU needs constant access to in order to make progress through the instruction pipe.

The CSR Unit

With the input and outputs requirements for the unit pretty clear, we can get started defining the basis for how our unit will be implemented. For RPU, we’ll have an internal signal for each relevant CSR, or manually hard-wire to values as required. Individual process blocks will be used for the various tasks required:

  • A “main” process block which handles any requested CSR operation.
  • A series of “update” processes which write automatically-updated CSRs, such as the instruction retired and cycle counters.
  • A “protection” process which will flag unauthorized access to CSR addresses.

At this point, we have not discussed unauthorized access to CSRs. The interrupt mechanism is not in place to handle them at this point in the blog, however, we will still check for them. Handily, the CSR address actually encodes read/write/privilege access requirements, defined by the following table:

So detecting and raising access exception signals when the operation is at odds with the current privilege level or address type is very simple. If it’s a write operation, and the address bits [11:10] == ’11’ then it’s an invalid access. Same with the privilege checks against the address bits in [9:8].

The “update” processes are, again, fairly simple.

The main process is defined as the following state machine.

As you can see, reads are quicker as they do not fully go into the modify/write phase. A full read-modify-write CSR operation will take 3 cycles. The result of the CSR read will be available 1 cycle later – the same latency as our current ALU operations. As the current latency of our pipeline means it’s over 2 cycles before the next CSR read could possibly occur, so we can have our unit work concurrently and not take up any more than the normal single cycle of our pipeline.

The diagram above shows the tightest possible pipeline timing for a CSR ReadModifyWrite, followed by a CSR read – and in reality the FETCH stage takes multiple cycles due to our memory latency, so currently it’s very safe to have the CSR unit run operations alongside the pipeline. As a quick aside, interrupts from here would only be raised on the first cycle when the CSR Operation was first seen, so they would be handled at the correct stages and not leak into subsequent instructions due to the unit running past the current set of pipeline stages.

A run in our simulation shows the expected behavior, with the write to CSR mtvec present after the 3 cycle write latency.

And the set bit operations also work:

Branching towards confusion

An issue arose after implementing the CSR unit into the existing pipeline, one which caused erroneous branching and forward progress when a CSR instruction immediately followed a branch. It took a while to track this down, although in hindsight with a better simulator testbench program it would have been obvious – which I will show:

Above is a shot of the simulator waveform, centered on a section of code with a branch instruction targeted at a CSR instruction. Instead of the instruction following the cycle CSR read being the cycleh read, there is a second execution of cycle CSR read, and then a branch to 0x00000038. The last signal in the waveform – shouldBranch – is the key here. The shouldBranch signal is controlled by the ALU, and in the first implementation of the CSR unit, the ALU was disabled completely if an CSR instruction was found by the decoder. This meant the shouldBranch signal was not reset after the previous branch (0x0480006f – j 7c) executed. What really needs to happen, is the ALU remain active, but understand that it has no side effects when a CSR operation is in place. Doing this change, results in the following – correct execution of the instruction sequence.

Conclusion

I’ve not really explained all of the CSR issues present in RISC-V during this part of the series. There is a whole section in the spec on Field Specifications, whereby bits within CSRs can behave as though there are no external side effects, however, bits can be hard wired to 0 on read. This is so that field edits which are currently reserved in an implementation will be able to run in future without unexpected consequences. Thankfully, while I did not go into the specifics here, my implementation allows for these operations to be implemented to spec. It’s just a matter of time.

This turned into quite a long one to write, on what I thought would be a fairly simple addition to the CPU. It was anything but simple! You can see the current VHDL code for the unit over on github.

Exceptions/Interrupts are coming next time, now that the relevant status registers are implemented. We’ll also touch on those pesky timer ‘CSRs’.

Thanks for reading, as always – feel free to ask any follow up questions to me on twitter @domipheus.

Flashback: Quake 3/RTCW Level Design

I recently discovered an old Level Design portfolio I had made whilst in school circa 2000-2002. This was before I started university and any foray into C/C++ coding. They all have a name watermark in a particularly awful font!

I had actually coded windows utility apps before this – in Visual Basic! Maybe I will document those in a future post.

For today, here is a list of some maps I made/released. The Quake 3 mod these were designed for was Quake 3 Fortress(Q3F), and the Return to Castle Wolfenstein(RTCW) mod was Wolftactics. I was not on the Q3F dev team – those maps are unreleased. However, I was part of the Wolftactics team – so those maps were released with the mod in 2003. For that mod, I did level design and most of the user interface scripting.

Finding these screenshots brought back great memories of the old modding communities I was part of. I feel this may be something that isn’t as widespread nowadays – but back then, modding was perfect for dipping toes into game development.

q3f_halls

2000 – Quake 3 Fortress, unreleased, but match tested

q3f_garden

2000 – Quake 3 Fortress, unreleased

q3f_generator

2001 – Quake 3 Fortress, unreleased

wt_mach

2002 – Wolftactics RTCW Mod – released

wt_forts

2002 – Wolftactics RTCW Mod – released

wt_rct

2002 – Wolftactics RTCW Mod – released

And some bonus items for those who scrolled to the end: my first taste of something I helped create being reviewed in the games press.

Ahh, the memories.



Designing a RISC-V CPU in VHDL, Part 17: DDR3 Memory Controller, Clock domain crossing

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and Iโ€™d recommend they are read before continuing.

In the last part we got to the point where RISC-V code, built with GCC, could run and display text over HDMI and blink LEDs. However, this could only run from the 192KB of Block RAM we initialized within the Spartan7 FPGA on our Digilent Arty S7-50 board. Whilst 192KB is a nice amount of on-FPGA fast storage, we have a 256Mbyte DDR3 chip sitting next to the FPGA which is crying out for use. This post follows on from part 16 and integrates a DDR3 memory controller provided by Xilinx, and using an SD card Pmod adapter, load code from the SD card into that memory for use.

The DDR3 memory chip on the ArtyS7 seems to vary from board to board, but the timing specifications should be compatible for all. For clarity, the chip on my development board is a PMF511816EBR-KADN.

It connects with address and data signals direct to the FPGA. There are many other control signals, but we are not going to get into how DDR3 memory works in this post, as Xilinx Vivado comes with a wizard for generating memory interfaces. The Memory Interface Generator (MIG) seems to be an older utility and I had a few problems with it. However, in the end, I did get a working interface generated.

Memory Interface Generator

Digilents documentation for the ArtyS7 board includes various files for importing into MIG to assist with generation of the memory interface. In MIG itself, you will not see an option for importing settings explicitly โ€“ you need to select that you are verifying an existing design. Before the following section walking through the wizard, ensure your Vivado project is located in a local drive โ€“ I generally worked on a mapped network drive, and the build will fail to implement the controller at synthesis time if the project is on a network drive. Another item to note โ€“ you must go through the IP Catalog. When I tried to I initiate the MIG through the Block Design view that I used for generating BRAM in the previous article, VHDL solutions could not be generated.

We go through the wizard, using the Digilent files to configure the memory controller.Generally it is a case of hitting next though each screen, except from a few pages at the start where you need to input the file paths, and validate the pins โ€“ but the validation should succeed with a few warnings.

The next screen was quite something, with the wizard saying the browse buttons for file paths were not supported on Windows 10.

Validating the pins above will only generate a few warnings. Click next through the rest of the wizard and generate the design.

When the implementation of the memory controller has completed, which took minutes on a Ryzen 2700X+SSD, we can see it in the IP pane. At this point, you can open an example project by right clicking the IP in the hierarchy.

The example project can run in the simulator very slowly (initialization took many seconds in the waveform before real simulation begins). You can see the VHDL component definition to use for the controller, and I copied that into the RPU project.

The generated design uses the User Interface(UI) implementation to the memory controller, which preserves things like request ordering. There is a Xilinx document UG586 which details the signals used in this interface, including example waveforms for various read and write request states.

The UI used to utilize the memory controller is good for the use cases that I want: request ordering is preserved, burst reads/writes, writes with byte enable. I originally wanted to use the Native interface โ€“ which you can read about in document UG586 if you want to know more – but what I wanted was very basic. The native interface is significantly more involved to use as things like out of order request completion can occur.

In addition to the main data and control busses youโ€™d expect, there are a few other signals which need connected in VHDL for the memory controller to operate. On ArtyS7, we need to feed it both 100 and 200MHz clocks. Additionally, we need to ensure the controller is not used after a reset until the initialization complete signal is asserted from the memory controller internals. One other major item of importance โ€“ the controller must be driven off of a ยผ memory interface clock, which for us is 81.25MHz. This is a good bit down on our current 200MHz CPU clock, but for now I just ran everything at this clock frequency, which is provided by the controller itself as an output.

The documentation from Xilinx shows how read and write transactions operate. The interface to the controller supports multiple back to back reads and writes for highest performance, but we only want basic single requests.

Writes follow this pattern:

Reads follow this:

We can implement these easily in our VHDL using the already existing MEM_proc memory request process. We will just add some new states to implement the various stages of a new state machine, and we should be good. The writes can be implemented in one of three ways, as explained in the above waveform diagrams. We are using method 1, asserting the *wdf signals for the write command data at the same time as the initiation of the commands using app_cmd.

As briefly mentioned earlier, we need to handle resetting of the memory controller. We activate the reset signals for the memory controller and CPU core for the same length of time. This is implemented with a decreasing 100MHz counter, initialized to some value in the thousands, with the resets being deactivated at 0. The DDR3 physical interface always initiates a calibration sequence after reset. When this operation is complete, a controller output signal is asserted which we should check for before sending memory requests.

Whilst making the requests and dealing with those states was straightforward, addressing and the resultant data output ordering was far from easy. I knew from MIG documentation that this controller would address 16 bits, so we just need to lop off the least significant bit from our memory request before passing it to the memory controller UI. However, after writing a small test program โ€“ which just wrote a known, incrementing pattern to contiguous memory addresses โ€“ it was obvious that my understanding of something else was awry. The data seemed to write 32 bits okay, but reading at different address offsets was quite the challenge. The Xilinx documentation does not explain the data organisation for the 128-bit data signals which come out of the controller. I assumed a simple burst read of [addr, addr+1, addr+2] data would present itself, but this didnโ€™t seem to be the case. I memory mapped the internal signals in and out of the Memory Controller, and wrote another test which read and wrote to DDR3 memory, and dumped the 128-bit read buffer as well. As you can see from below, there were some really odd things going on:

I managed to find patterns for all the different read modes we needed, and the naturally aligned addresses for 4, 2 and 1 byte reads and writes. For writes, I used some lookup tables to assist with generating the byte enable signal โ€“ the 128bit write data input was byte masked, so that writing values which are not 128 buts in length do not appear to require a read-modify-write operation.

As it stands, I have still not found out why the read operations result in odd looking data from the controller. All things considered, itโ€™s likely I have made an error early on in the implementation of the data swizzling for writes using the controller, which then dominos into the read operations. Itโ€™s something Iโ€™ll be returning to. This is fine for us just now, but I was intending on using the burst data to fill caches in later work on the CPU, and for that to work efficiently we will need to fully understand how this burst data organization works.

So. 256MB of ram unlocked, and passing fairly basic read/write tests. But the latency is very slow! Much slower than expected. I added some counters to the read state machine, and it seems to take around 22 cycles for a read. At 82MHz this is a long time. However, when searching online, it seems others have also seen this kind of latency, so I did not look into it โ€“ and instead tried to decouple the DDR memory clocking from that of the CPU and the rest of the SoC, like block rams.

Clock Domains

A clock domain can be imagined by separating out all aspects of your design into blocks. For each block which runs off a certain clock, it is on that clocks domain. Generally, blocks which synchronize off the same clock, and therefore are in the same domain, can pass signals between them relatively easily.

Currently, we have many clock domains in the SoC, but only 1 in the CPU side of things. The other clock domains are for HDMI output and due to the Block Rams being dual ported with dual clock inputs, we do not need to pass any raw signals across clock domains. For our DDR3 and CPU clocks to be different however, multiple signals would need to cross clock domains.

There are numerous issues that can arise when trying to pass synchronized signals across different clock domains. The simplest issue is when trying to read a fast clock signal from a slow clock โ€“ and the slow clock completely missing a strobe of the faster signal.

There are various methods of crossing signals from one clock to another, with some basic techniques starting with using multiple flip flops to latch signals into other domains. In VHDL, this just looks like a process which reads and latches the value from one signal into another on the destination clock.

As we are dealing with single read and write transactions, I decided to cross the CPU to Memory Controller clock domain using state handshaking. We will use two integer signals, which have multiple versions for stability in each clock domain, to implement a state machine process which crosses the clock safely. This can introduce several cycles of additional latency, however at the moment we are trying for a working solution โ€“ not a very efficient one. We can give up a few cycles of latency in the CPU clock domain if this means we can run the CPU much faster than the memory controller. The dataflow for a read looks like the following.

And writes:

The way this is implemented is with two processes, one clocked off the Memory Controllers 81.25MHz, the other off the higher CPU clock. The CPU process is actually just the MEM_proc one from before, to make interfacing with the CPUs own memory request logic easier. Each process reads the state of the other using stable signals which are latched into their own clock domain in an attempt to avoid metastability.
When both differently clocked processes are in a known state, I read the memory data required across the clock domain. I am not sure this is strictly safe โ€“ some say things such as data buses should be passed through a FIFO.

It took a few attempts to get this right, mainly due to realising that now I could not run the CPU at previously attainable speeds. It seems that just having the DDR3 controller synthesized into the SoC meant that the CPU core could not get over 166MHz. I have settled on 142.8MHz for now (1GHz/7 โ€“ easily attainable from my existing clocking system) as the CPU clock โ€“ fast enough for most needs but low enough that any timing issues do not arise. Many additional timing warnings appear on synthesis when the DDR3 controller is included, which was unexpected since this controller was generated by the internal tools. Like I mentioned last part, I do intend to look into these warnings and understand them with an aim of resolving them to unlock higher clocks. For now, 142MHz will do fine โ€“ and the additional memory space of DDR3 is very welcome!

With the DDR3 memory integrated, I also made quick microSD PMOD adaptor out of a level-shifting microSD adaptor intended for 5v Arduino projects using SPI. The level shifter is not needed for our purposes, as the FPGA already runs on 3.3v logic. No modification of the part was required โ€“ just soldering the pins to the 0.1โ€ PMOD headers was enough to work.

The SPI port is exposed externally on Pmod JD. The way the SPI port is accessed is the same as discussed in the previous posts โ€“ just different memory mapped addresses. I created a new bootloader for the internal BRAM to perform a small memory test (which validates the endian byte swizzling), then initialize SPI and mount an SD card if present. It will examine a BOOT elf file and copy it into the required physical memory location, which can be in DDR3 or BRAM. It will then jump to that elf entry point to continue execution.

FPGA Utilization

With this all coming together to form a SoC design, I looked into the utilization report, which tells you how many resources your design is using. The report looks as follows:

The DDR3 memory controller takes up the largest amount of resources โ€“ not unexpected, they are incredibly complex! Iโ€™m quite happy with the current utilization of the RPU core itself. Iโ€™ve not been looking into optimization for resources, so having the CPU take up less than 6% of the Spartan S7-50 seems good. Looking at the breakdown of where the utilization of resources comes from, it seems the internal management of memory requests is taking a larger amount of resourced than Iโ€™d expect. It will be interesting to see how this changes when the endian swizzle logic is brought into the CPU core, and eventual cache logic added.

That bring this part to a close. I am now looking into implementing the required RISC-V Control and Status Registers (CSRs) into RPU. Currently I have memory-mapped data that should be available though CSR instructions. Adding them will be interesting as it will require changes to the CPU interface. Read about that next time!

As mentioned last time, the code for the DDR3 controller is already up on github. I still need to check that Iโ€™ve implemented the read/write data swizzling to the DDR3 correctly โ€“ Iโ€™ve a feeling there is a mistake in the writes somewhere, which then impacts the reads.
Thanks for reading, and as always, any questions can be directed at myself on twitter @domipheus.