Designing a CPU in VHDL, Part 15: Introducing RPU

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

It’s been a while. Despite the length of time and lack of posts, rest assured a significant amount of progress has been made on my VHDL CPU over the last year. I’ve hinted at that fact multiple times on twitter, as various project milestones have been hit. But what’s changed?

First and foremost; the CPU now consumes RISC-V. It’s decoder, ALU and datapaths have been updated. With that, the data width is now 32 bit. The Decoder accepts the RV32I base integer instruction set.

I’d been putting off multiple side-projects with my existing TPU implementation for a while. It’s 16-bit nature really made integrating the 32MB SDRAM into the system a rather pointless affair. I’m all for reminiscing but I did not want to go down the rabbit hole of memory bank switching and the baggage that would entail. The toolchain for creating software was already at the limit – not the limit in terms of what could be done – but the limit in what I’m prepared to do in order to perform basic tasks. We all love bare metal assembly, but for the sake of my own free time, I wanted to just drop some C in, and I was not going to make my own compiler. I looked into creating a backend for LLVM, but it’s really just another distraction.

As a reminder, here is where we left off.

Block diagram of TPU CoreSo, what’s actually changed from the old 16-bit TPU?

  • Many more registers,
  • Datapath widened to 32-bit
  • Decoder and ALU updates for RV32I ISA
  • Glue logic/datapath updated for new functions

The register file is basically the exact same as before with the 16-bit TPU, extended for 32 entries of 32 bits names x0-x31. In RV32I, x0 is hardwired to 0x00000000. I’ve left it a real register entry, but just never allow a write to x0 to progress, keeping the entry always zero. The decoder and ALU is fairly standard really – there are a small number of instruction formats, and where immediates are concerned there is some bit swapping here and there, but the sign bits are always in the same place which can make things a bit easier in terms of making the decode logic easier to understand.

The last item about datapath changes follows on from how RISC-V branch instructions operate. TPU was always incredibly simple, which meant branching was quite an involved process when it came to function calling and attempting to get a standardized calling convention. There was no call instruction, and the amount of operations required for saving a calculated return from function address, setting up call arguments on the stack and eventually calling via an indirect register was rather irritating. With the limited amount of registers available on TPU, simplification via register parameter passing didn’t solve much as you would be required to save/restore register contents to the stack regardless.

RISC-V JAL instruction definition screenshot from ISARISC-V has several branch instructions with significant changes to dataflow from that of TPU. In addition to calculating the branch target – which is all TPU was able to do, RISC-V calculates a return address at PC+4, which is then written to a register. This means our new ALU needs two outputs, the standard result of a calculation, and the branch target. The branch target with the shouldBranch output status feeds into our existing PC unit, with the result feeding as normal the register file/memory address logic. The new connections are shown below in yellow.

RPU block diagramIn terms of old TPU systems, the old interrupt system needs significant updating and so is disabled. It’s still got block ram storage, UART, and relies on the same underlying memory subsystem, which in all honesty is super bloated and pretty poor. The memory system is currently my main focus – it’s still made up from a single large process at the CPU core clock. It reads like an imperative language function, which is not at all suitable for the CPU moving forward. The CPU needs to interface with various different components, from the block rams, UART, to SPI and SDRAM controllers. These can be running at different clocks and the signals all need to remain valid across these systems. With everything being accessed as memory-mapped IO, the memory system is super important, and I’ve already run into several gotchas when extending it with an SDRAM controller. More information on that later.

Toolchain

As I mentioned earlier, one of the main reasons for moving to RISC-V was toolchain considerations. I have been developing the software for my ‘soc’ with GCC.

With Windows 10, you can now use the Linux Subsystem for Windows to build linux tools within Windows. This makes compiling the RISC-V toolchain for your particular ISA variant a super simple process.

https://riscv.org/software-tools/ has some details of how to build relevant toolchains, and Microsoft have details of how to install the Linux Subsystem.

Data Storage

With the RISC-V GCC toolchain built, targeting RV32I, I was able to write a decent amount of code in order to create a bootloader which existed inside the FPGA block ram. This bootloader looked for an SD card in the miniSpartan6+ SD card slot. It initialized the SD card into slow SPI transfer mode so we could get at it’s contents.

Actually getting to data stored on an SD card is incredibly simple, and has been written up nicely here. In terms of how the CPU accessed the SD card, I found a SPI master VHDL module and threw it into my project, and memory mapped it so you could use it from the CPU.

#define SPI_M1_CONFIG_REG_ADDR 0x00009300
#define SPI_M1_BUSY_REG_ADDR   0x00009304
#define SPI_M1_DATA_REG_ADDR   0x00009308

void spi_sd_reset()
{
	volatile uint32_t* spi_config_reg = (uint32_t*)SPI_M1_CONFIG_REG_ADDR;

	uint32_t current_reg = *spi_config_reg;
	uint32_t reset_bit = 1U << SPI_CONFIG_REG_BIT_RESET;
	uint32_t reset_bitmask = ~reset_bit;

	// Multiple writes gives the controller time to reset - it's clock
	// can be slower than this CPU
	*spi_config_reg = current_reg & reset_bitmask;
	*spi_config_reg = current_reg & reset_bitmask;
	*spi_config_reg = current_reg & reset_bitmask;
	*spi_config_reg = current_reg & reset_bitmask;

	// return to original, but ensure the reset bit is set (active low), 
	// in case the register was previously clobbered by some other operation
	*spi_config_reg = current_reg | reset_bit;
}

uint8_t spi_sd_xchg(uint8_t dat)
{
	volatile uint32_t* spi_busy_reg = (uint32_t*)SPI_M1_BUSY_REG_ADDR;
	volatile uint32_t* spi_data_reg = (uint32_t*)SPI_M1_DATA_REG_ADDR;

	*spi_data_reg = (uint32_t)dat;

	while (*spi_busy_reg != 0);

	return *spi_data_reg;
}

A few helper functions later, the bootloader did a very simple FAT32 root search for a file called BOOT, and copied it into ram before jumping to it. It also copied a file called BIOS to memory location 0x00000000 – which had a table of I/O functions so I could fix/extend functionality without needing to recompile my “user” code.

typedef struct {
    FN_sys_console_put_stringn    sys_console_put_stringn;
    FN_sys_console_put_string     sys_console_put_string;
    FN_sys_console_setcursor      sys_console_setcursor;
    FN_sys_console_setpen         sys_console_setpen;
    FN_sys_console_clear          sys_console_clear;
    FN_sys_console_getcursor_col  sys_console_getcursor_col;
    FN_sys_console_getcursor_row  sys_console_getcursor_row;
    FN_sys_console_getpen         sys_console_getpen;
    FN_sys_clk_get_cyclecount     sys_clk_get_cyclecount;
    
    FN_spi_sd_set_clkdivider      spi_sd_set_clkdivider;
    FN_spi_sd_set_deviceaddr      spi_sd_set_deviceaddr;
    FN_spi_sd_xchg                spi_sd_xchg;
    FN_spi_sd_reset               spi_sd_reset;
    
    FN_uart_tx1_send              uart_tx1_send;
    FN_uart_rx1_recv              uart_rx1_recv;
    FN_uart_trx1_set_baud_divisor uart_trx1_set_baud_divisor;
    FN_uart_trx1_get_baud_divisor uart_trx1_get_baud_divisor;
    
//...
    
// Above functions are basic bios and provided by BIOS.c/BIN
// Below are extensions, so null checks must be performed to check 
// initialization state
    FN_pf_open    pf_open;
    FN_pf_read    pf_read;
    FN_pf_opendir pf_opendir;
    FN_pf_readdir pf_readdir;
} SYSCALL_TABLE;

With the FPGA set to look for this BOOT file off an SD card, testing various items became a whole lot easier. Not having to change the block ram and rebuild the FPGA configuration to run a set of tests saved lots of time, and I had a decent little setup going, however I was severely limited by RAM – many block rams were now in use, and yet I still did not have enough RAM. I could have optimized for space here and there, but when I have a 32MB SDRam chip on the FPGA dev board it seemed rather pointless. It was finally time to get the SDRAM integrated.

SDRAM

Getting an SDRAM interface controller for the chip on the miniSpartan6+ was very easy. Mike Field had done this already and made his code available. The problem was that whatever I tried, the data out from the SDRAM was garbage.

Or, should I say, out by one. Everything to do with computers has an out by one at some point.

Anyway, data was always arriving delayed by one request. If I asked to read data at 0x0000100, I’d get 0. If I asked for data at 0x0000104, I’d get the data at 0x00000100. I could tell that writing seemed to be fine – but reading was broken. Thankfully discovering this didn’t take too much time, due to being able to boot code off the SD card. I wrote little memory checking binaries and executed them via a UART command line, the debug data being displayed on the HDMI out. It was pretty cool, despite me getting wound up at how broken reading from the SDRAM was.

memory testIt took a long time to track down the actual cause of this read corruption. For the most part, I thought it was timing based in the SDRAM controller. I edited the whole memory system multiple times, with no changes to the behavior. This should have been a red flag for me, as the issue was being caused by the constant through all of these tests – the RPU core!

In RPU, the signals that comprise the input and output of each pipeline stage get passed around and moved from unit to unit each core clock. This is where my issue was – not the “external” memory controller!

Despite my CPU being okay with waiting for the results of memory reads from a slower device, it was not set up for correctly forwarding that data on when the device eventually said the request had completed. A fairly basic error on my part, made harder to track down by the fact the old TPU CPU design and thus RPU had memory requests going through the ALU, then control unit, and some secondary smaller memory controller which didn’t really do anything apart from add further latency to the request.

I fixed this, which immediately made all the SDRAM memory space available. Hurrah.

It’s rather annoying this issue did not present itself sooner. There are already many “slow” memory mapped devices connected to the CPU core, but they generally work off of FIFOs – so despite them being slow, they are accessed via queries to lines to say whether there is data in the FIFO, and what the data is. These reads are actually pretty fast, the latency is eaten elsewhere – like spinning on a memory read. Basic SDRAM read latency however for this was in the 20 cycle range at 100MHz, way more than previously tested.

Having the additional memory was great. Now that we have a full GCC toolchain, I could grab the FatFs library, and have a fully usable filesystem mounted from the SD card. Listing directories and printing the contents of files seemed like such a milestone.

Next steps

So that’s where we are at. However, things be changing. I have switched over to Xilinx Spartan 7, using a Digilent Arty S7 board (the larger XC7S50 variant). I’m currently porting my work over to it.

If you saw my last blog post, you’d have seen that HDMI out at least works, and now I’m working on porting the CPU over, using block rams and SD card for memory/storage. Lastly will be a DDR3 memory controller. 256MB of 650MHz memory gives much more scope for messing up my timing even more. It should be great fun!

I hope to document progress more frequently than over the past year. These articles are my projects documentation, and provides an outlet to confirm I actually understand my solutions, so the lack of posts has made things all the more disjoint. The next post shall be soon!

Thanks for reading. Let me know of any queries on twitter @domipheus.

HDMI over Pmod using the Arty Spartan 7 FPGA board

tl;dr: This post shows that driving DVI-D over an HDMI cable, directly connected to the High Speed Pmod connector of Digilents Arty S7 board, is very much possible- even at high resolution.

I’ve been working away on my RISC-V FPGA based computer ‘kit’, which is based on my VHDL CPU: ported to RISC-V. I wanted to get a new development board with faster ram, and found it hard to find boards with DDR3 memory, a large enough FPGA, SD card interface, and HDMI out.

The SD card was not really a problem – it’s low speed, you can just connect it with slow SPI I/O. HDMI is certainly considered high speed – bandwidth across the 4 serial channels top 4.4Gbps – but is driving HDMI though basic I/O interfaces possible?

It most certainly is!

A caveat, though: There is no protection circuitry. Do this at your own risk 🙂

The Digilent Arty S7 board can be had for sub £100, and my XC7S50 variant cost £119 shipped. It has DDR3, but no on-board HDMI. However, it turns out you can get perfectly acceptable (for my needs, anyway) output from the high-speed Pmod I/O ports. This will go very nicely with the small USB powered HDMI screens you can find, which are very handy. Image showing development board conencted via breadboard wires to an HDMI connector breakout board, which is driving a small HDMI display. I have put a 720p output test on github. It should load in Xilinx Vivado out of the box. You need to connect an HDMI cable to the Pmod, by either splicing a cable, or getting something like the Adafruit breakout board (seen above). It connects to Pmod JA, circled.

Image showing what I/O port to use.In the constraints file for the project, the Pmod pins are specified to use the TMDS_33 standard. The pinout is defined as follows:

A diagram showing what Pmod pins are for what HDMI purpose.The VHDL code in the example uses a simplified 1280x720x60 pixel clock of 75MHz – not the required 74.25MHz as per the standard. Due to this, some TVs/monitors may not accept the signal. It runs fine on my Acer XF240H and Samsung LU28E590DS using a 1m cable. I have not tried a television – they tend to be more picky. You can get much closer to 74.25MHz by chaining clock generators, but I have not done that in this instance. The refresh rate reports as 61Hz with this clock, likely due to it being out of spec.

If you want to learn more about how my example project generates a HDMI/DVI-D signal, you can find details here, which is part of my Designing a CPU in VHDL blog series.

Picture showing a monitor displaying a 720p signal.A video of it in action, using timings applicable for my 800×480 module.

You can change the pixel clock and the vga timings (sync starts, ends, porches ) in the code to generate different resolutions. For 1080p60Hz the pixel rate is supposed to be 148.5MHz, but again my monitor will accept a rate of 150MHz, and show 61Hz.

Even 1440p30 is possible. 1440p60 did not work, but I didn’t try hard to get a more accurate pixel clock in that instance. When you get to higher clocks, you can start to get timing constraint issues. At 1080p and 1440p I had some timing failures listed in the implementation report, but they did run. If you were using this in a real system, you’d have to fix those timing issues. That’s out of the scope of this blog, though 🙂

So there you have it. You really can bodge HDMI/DVI-D output directly through a Pmod!

Thanks for reading. Let me know what you think on twitter @domipheus.

The Boat PC – a marine based Raspberry Pi project

Motivation

In late 2015 I was doing my usual head-scratching about what gifts to get various family members for the holiday season. My wife mentioned making something electronic for my father-in-laws boat, and after a few hours of collecting thoughts came up with an idea:

  • A Raspberry Pi computer, which could be powered off the boats 12v batteries.
  • This computer would have sensors which made sense on a boat. Certainly GPS.
  • I’d have some software which collated the sensor data and displayed it nicely.
  • This could plug into the onboard TV using HDMI.
  • It would all be put into a suitable enclosure.

Excellent – a plan. I expected the hardware part to be easy, the enclosure part fairly straightforward, and the software part to be an absolute disaster. I started searching for an already-existing project to take care of the software side of things.

That’s when I came upon a project called OpenPlotter. It’s a fully-featured linux distribution for Raspberry Pi, specifically for use on a boat, and includes the relevant software for calibrating, collating and transforming data from various sensors into a form that can be used practically. I’ve got to be honest here – OpenPlotter is solid, does exactly what it advertises, and very simple for someone familiar with RPi/Linux to set up and use.

After firmly deciding on OpenPlotter for the software, and knowing I’d be using an old Raspberry Pi 2 I had collecting dust, I looked at what hardware OpenPlotter supported. The list is fairly long, and gave me ideas I had not thought of previously – for example using a USB DVB-T television dongle as an AIS receiver with Software Defined Radio (SDR), allowing real-time data of nearby ships to be displayed. MarineTraffic uses this AIS data, but of course on a boat you can’t rely on an internet connection to pull data from – it’s much better to get the data directly from the VHF signals.

In addition to AIS and GPS, I’d add an Inertial Measurement Unit (IMU – basically an accelerometer, gyroscope and magnetometer in one) in the form of an InvenSense MPU-9150, and also a USB to RS422 converter. RS422 is specified as part of the protocol standard for NMEA 0183, which in turn is the communication specification used in marine electronics. Supporting input and output of direct NMEA using RS422 would allow for some extendibility, for example depth sensors that are already present can feed data into OpenPlotter using this port.

After going and purchasing all of these sensors, I realised that actually using the TV inside the boat isn’t going to be useful, as it’s not visible from the helm. Thankfully, OpenPlotter allows for headless operation, and will automatically set up a WiFi hotspot so you can connect a phone/tablet to the Raspberry Pi and control it using VNC or other software.

The Build

So, to clarify, all the hardware gubbins required:

  • Raspberry Pi 2
  • Invensense MPU9150 board
  • RTL2832U DVB-T USB
  • USB to RS422 Converter
  • USB GPS module
  • USB wifi module

Of course, we need some associated utility to make this into an actual device;

  • 12V to 5V power converter
  • Power switch & connector
  • Status LED
  • Enclosure

When I’ve done projects in the past (the biggest one being PiOnTheWall from years ago), I spend a significant amount searching for the right enclosure to put the hardware in. It’s not just a case of going and getting something that’s big enough to fit the contents, you need to know how thick the sides are, what kind of plastic is it, are there PCB standoffs included, are there vent holes?

After several days, I came up with the following which I got off ebay.

enclosure

I knew already the RTL2832U SDR dongle could run quite hot – so ventilation holes were a must. It’s the hottest part of this hardare, easily 60C+, whilst the Broadcom SOC of the Raspberry Pi will have to be working fairly hard to hit 45C. I did not plan to heatsink anything, and in the end it works fine without them. I did make a concious choice though to have the SDR board at the highest point in the enclosure, closest to the vents.

The design was simple – switch and status LED at the front, RS422, SDR antenna, Power In and Raspberry Pi Mini USB/HDMI/Audio out at the back. I removed all plastic covers from any USB devices, as they just bloated the inside, and I knew removing USB connectors would be a requirement. Laying out the components, I found one which worked well.

layout_annotated

The Raspberry pi would be put on metal standoffs – I used some spares I had from various PC motherboards and cases. I just drilled straight though the bottom of the plastic case with a bit size such that the thread would drive into the plastic.

In my previous Raspberry Pi project I butchered the board, and I’m pleased to say the only thing I had to do in this instance was make the fixing holes on the PCB slightly larger to accommodate the screws for standoffs.

rpi_drill

standoffs

The GPS and Wifi modules remained as dongles, simply connected into one dual header on the Raspberry Pi. To aid fitting all the boards into the enclosure, the male USB connector of the RTL2832U SDR dongle was placed on a ribbon cable. Additionally, the miniUSB cable for the RS422 converter was made small enough to fit in the limited space available. These two boards were physically fixed to the rear panel via bolts, and in the SDR boards case, a little shelf made from spare plastic.

422_sdr_cables

422_sdr_affixed

I’m not very good at making good panel openings, so sadly my HDMI and microUSB ports are very poor. At least they are at the back, where nobody should be able to see them 😉

Internally, all that was left was to connect the 12V->5V DC-DC converter to the Pi, put a power switch inline with the input 12v Power jack, attach the LED to 3v3 (there is a resistor in the LED leg heat-shrink), and fix the rest of it down with the same standoffs. It ended up looking fairly neat and tidy.

complete_internals

For those wondering, I connected the 5V output from the DC-DC converter direct to the 5V rail of the Pi. It bypasses some input protection which exists on the miniUSB power input. For me this is okay, I hoped it would allow the SDR USB dongle to draw more power than is ‘technically’ allowed from the onboard USB ports. I knew that was an issue back in the Raspberry Pi 1 days, and couldn’t remember if that was still the case with RPi 2.

The final rear panel:

complete_rear

The front of the enclosure, unit powered and closed.

complete_front

You will notice the USB socket on the front; I thought it could be useful to trickle charge phones or the tablet that would connect through WiFi to offer controls. I connected the unit to an HDMI monitor to do first-time OpenPlotter setup, making sure the sensors worked, and then switched it into headless mode, with VNC and NMEA 0183 output over it’s own ad-hoc WiFi hotspot.

Testing on the boat!

One thing that I could not test at home and needed to do on the boat was calibrate and test the AIS Receiver. There was a long gap between the hardware being “complete” in summer 2016, and testing it on-board in spring 2017.

AIS runs off VHF frequencies of around 162MHz, a wavelength of 1.85 meters. The boat has a marine antenna already which will work fine, but when I brought the device for testing did not have the correct connector to interface with the SDR dongle.

antenna

Because of this, I made a quick and dirty 1-wire, quarter wavelength antenna. I used a good quality coax, with one end exposing only the inner core to a length of 46 centimeters. I then hooked this around a bit of the boat outside. It wouldn’t get long range, but hoped I’d get some ship signatures in the marina – and it did! After following the calibration instructions on the OpenPlotter guide, I rebooted and after a few minutes the tablet (now connected to the RPi using wifi) displayed the following:

tablet_ais

We used an Android app called SailTracker which takes the collated NMEA datastream and displays the data in an appropriate format. There are several paid apps that come complete with nautical maps, which is neat.

installed

And that’s it! All installed, wired into the 12v, and also now using the VHF antenna at the top of the mast. I’m quite proud with how this one turned out, and I’m very impressed with the OpenPlotter distribution for allowing this project to work as well as it did.

What I’d change

There are 3 things I’d change if I was to do this again:

  1. Changing the front panel LED to RGB, and have it a real status LED rather than power. For example,
    • solid blue: OS booting,
    • flashing green: OpenPlotter starting services,
    • solid green: WiFi hotspot up,
    • red would be an error condition.
  2. Mounting the SDR dongle further in, allowing me to wire up the antenna input from the onboard mini MCX to a PL259 VHF connector on the rear panel. This would have eliminated some of the external complexity of needing various converters.
  3. I’d have a large cover over the microUSB/HDMI/audio raspberry pi connectors, as they are really only needed for debug, and it would have stopped me from making the messy cuts I did 🙂

Thanks for reading. If you have any questions or queries feel free to contact me at @domipheus.

Porting my VHDL Character Generator to Spartan3: Reducing clock speeds and pipelining

This is an article on porting my VHDL character generator from a Xilinx Spartan6 device to one with a Spartan3. It starts off as a simple port, analyzing device primitive differences and accounting for them in the design. Along the way, there were considerations on how clocks were generated, characteristics of block ram timing, and general algorithmic design. I’ll assume you’ve read the sections of my Designing a CPU in VHDL series specifically detailing the implementation of the character generator.

Reading time: 10 minutes

When I first attempted to synthesize my TPU CPU Core design on to the miniSpartan3 developer board (made by the great folks at Scarab Hardware), the bulk of the code went without a hitch. The processor core itself contains no primitive parts specific to a single vendor. However, the rest – Block Rams, Clock Generators – used instantiations of specific device primitives. These are different from family to family of FPGAs and those are where the most thought and investigation is needed, as changes can have knock-on impacts to operations further along the device path.

High level device differences are fairly minimal. On the board itself, we have a 32MHz clock input on the miniSpartan3 board instead of the 50MHZ input on the miniSpartan6 setup. So we will need to change ratios of how we generate pixel and ram clocks for the DVI-D/HDMI video output. The FTDI chip for serial communication is similar and a communications channel is connected to the FPGA. We will need to change the constraints file for the pin definitions as well, but that’s always expected.

Clocks

The Spartan6 TPU design utilizes multiple clocks:

  • Base Clock 50MHz
  • CPU Core 100MHz
  • Pixel 25MHz
  • 5x Pixel 125MHz
  • 5x Pixel Inverted 125 MHz
  • Char/Text Clock 250MHz

The CPU Core and Read/Write Port A’s of all Block Rams use the CPU Core clock. The UART Baud Generator uses the Base Clock. The VGA/Graphics signal generator uses the Pixel clock. The TMDS/DVI-D/HDMI encoders and output buffers use the 5x Pixel and 5x Pixel Inverted clocks. The Character generator, and relevant Port B block rams (Text and Font Rams) use the Char/Text Clock.

When porting to Spartan3, we need to use Digital Clock Managers (DCMs) instead of the Phase Locked Loops (PLLs) on Spartan6. The interface to DCMs is considerably different, but the base terms remain and you can understand what needs to change in the design without much thought.

s6_pll

One of the main issues is that DCMs have much less outputs than the PLLs. On the Spartan6 implementation, a single PLL primitive is used to drive all of the different clocks require. On Spartan3, we will need a DCM for each frequency.

s3_dcm

Due to this, we will require 3 DCM objects. Our Spartan3 chip XC3S200A only has 4 in total, so we are using a significant amount of resources to generate these clocks. However, we do have the available DCMs to get started immediately.

The DCMs themselves have multiple configurations to set up. We use the clock synthesizer(DFS) to get our 25MHz pixel clock from our 32MHz input. The maximum rangers for the DFS is outlined in the Spartan-3A datasheet.

dfs_limits

To generate the pixel clock, we multiply our 32MHz input by 15 to 480MHz then divide by 19 to get 25.2MHz.

-- 32MHz -> ~25MHz
DCM_SP_inst : DCM_SP
generic map (
  CLKDV_DIVIDE => 2.0, --  Divide by: 1.5,2.0,2.5,3.0,3.5,4.0,4.5,5.0,5.5,6.0,6.5
                       --     7.0,7.5,8.0,9.0,10.0,11.0,12.0,13.0,14.0,15.0 or 16.0
  CLKFX_DIVIDE => 19,         --  Can be any interger from 1 to 32
  CLKFX_MULTIPLY => 15,       --  Can be any integer from 1 to 32
  CLKIN_DIVIDE_BY_2 => FALSE, --  TRUE/FALSE to enable CLKIN divide by two feature
  CLKIN_PERIOD => 32.0,       --  Specify period of input clock
  CLKOUT_PHASE_SHIFT => "NONE", --  Specify phase shift of "NONE", "FIXED" or "VARIABLE" 
  CLK_FEEDBACK => "1X",         --  Specify clock feedback of "NONE", "1X" or "2X" 
  DESKEW_ADJUST => "SYSTEM_SYNCHRONOUS", -- "SOURCE_SYNCHRONOUS", "SYSTEM_SYNCHRONOUS" or
                                         --     an integer from 0 to 15
  DLL_FREQUENCY_MODE => "LOW",     -- "HIGH" or "LOW" frequency mode for DLL
  DUTY_CYCLE_CORRECTION => TRUE,   --  Duty cycle correction, TRUE or FALSE
  PHASE_SHIFT => 0,        --  Amount of fixed phase shift from -255 to 255
  STARTUP_WAIT => FALSE)   --  Delay configuration DONE until DCM_SP LOCK, TRUE/FALSE
port map (
  CLK0 => CLK0,     -- 0 degree DCM CLK ouptput
  CLK180 => CLK180, -- 180 degree DCM CLK output
  CLK270 => CLK270, -- 270 degree DCM CLK output
  CLK2X => open,    -- 2X DCM CLK output
  CLK2X180 => open, -- 2X, 180 degree DCM CLK out
  CLK90 => open,    -- 90 degree DCM CLK output
  CLKDV => open,    -- Divided DCM CLK out (CLKDV_DIVIDE)
  CLKFX => clock_pixel_unbuffered,   -- DCM CLK synthesis out (M/D)
  CLKFX180 => CLKFX180, -- 180 degree CLK synthesis out
  LOCKED => LOCKED, -- DCM LOCK status output
  PSDONE => PSDONE, -- Dynamic phase adjust done output
  STATUS => open,   -- 8-bit DCM status bits output
  CLKFB => CLKFB,   -- DCM clock feedback
  CLKIN => clk32_buffered,   -- Clock input (from IBUFG, BUFG or DCM)
  PSCLK => open,    -- Dynamic phase adjust clock input
  PSEN => open,     -- Dynamic phase adjust enable input
  PSINCDEC => open, -- Dynamic phase adjust increment/decrement
  RST => '0'        -- DCM asynchronous reset input
);
	

Block Rams

The block rams on Spartan3 are very similar to the Spartan6 counterparts. They do have different characteristics in terms of timings and therefore maximum operating frequency. The Block Ram primitives on my Spartan3 are not rated for the ~260MHz that those on the Spartan6 run at – so there will be changes required to the Character generator as to account for additional latency in the memory operations.

UART

Thankfully, Xilinx provide the PicoBlaze UART objects for Spartan3 a they do for Spartan6, so there was very little work required in porting these over, apart from using different library objects. The Baud clock routine was changed to strobe correctly using the 32MHz base clock instead of 50MHz on the miniSpartan6. That was the only significant change here.

Differential Signalling Buffers

The OBUFDS output buffers used before can be used on Spartan3, along with the ODDR2 Double Data Rate registers for generating the 10x HDMI signalling.

Character Generator

Most work was on the Character Generator. This was due to the base algorithm of the system needing slight amendments to account for the increased memory latencies and slower clocks. However, I think it’s useful to see what things can happen if we ignore all of that for a second, and simply ‘blind port’ the system ignoring the rated maximum frequencies, just to see what happens.

In fact, the blind port was my first attempt. And this was the result:

There are a few points of interest that you can take from this footage. I’ve singled out a frame to identify them easier.

corruption

  1. The Colour are correct for the areas they should be
  2. The Glyphs seem to be correct
  3. The corruption while random occurs in X directions, as there are bars which are consistent across character locations.

If we look again at the state diagram for the character generator:

text_mode_diagram

As we can tell that the character along with the colour is correct, we know it’s not data corruption in transfer. However, the vertical banding is occurring at the start of the character, indicating the glyph row data is not getting to the system in time.

blockrammhz

In this situation I went to the datasheets and application notes to find maximum frequency ratings for the clock synthesizers and block rams. The 250MHz char/pixel clock is well within specification for generation, but the block rams are only rated for 200MHz. Instead of attempting to redesign the character generator to run off a new slightly slower clock (200MHz), I started modifying it so that it would operate correctly at the 5x pixel serialization clock – as this would free up another DCM object and reduce our utilization from 3 to 2.

The way I started with this problem was to delay the character generator by a single pixel, allowing to pipeline the memory requests up over two pixels instead of one. This would then give us the 10 sub-cycles per pixel.

pipeline_diagram

Table A) shows how 10 sub-cycles are required, table B) shows how they would fit together into a pipelined 5 sub-cycle state machine and C) shows that optimized, as certain stages need only occur when you crossover into a new character. The latching and fetching of the glyph data is idempotent and does not incur additional costs, as the tram data which derives the addresses for glyph rows is only fetched on each 8-pixel character transition.

My first implementation of this seemed to work well enough, apart from there being a duplicated pixel in the glyph.

lastorfirstduplicated

It is harder than you’d think to tell whether this duplicated pixel was at the start or end of glyph processing, so I forced the background colour of the screen to flip-flop between character locations as they were output, allowing you to see the specific zone that a glyph should reside within.

bars_debug

From here, I could tell it was the first pixel which was causing the problem. The first one encapsulates all memory requests – with the further 7 pixels in a glyph row only utilizing a cached version of the data, so from here it was time to go into the simulator and look at some internal signals.

In the Simulator

The first thing to notice was there was disparity between the pixel x coordinate and the actual pixel/5x pixel clocks. Due to the pixel operations being driven from the 5x clock, there could be instances where mid-request there was changes in coordinates. The way to fix this was to have a process driven off of the pixel clock, which then latched the X and Y coordinates, which then the various other logic driven from the 5x clock could utilize.

latching_coords_2

You can see in the above waveform the issue clearly. At (a) we kick off a new initial glyph row request (State 1 is only ever entered on the first pixel of a character row). If we do not latch the coordinate, half way though the request at (b) we could have the coordinate flip.

Since we already had a process running from the pixel clock to manage the blinker flags, this was a simple addition.

  -- This process latches the X and Y on the pixel clock
  -- Also manages the blinker.
  process(I_clk_pixel)
  begin
    if rising_edge(I_clk_pixel) then
      blinker_count <= blinker_count + 1;
      x_latched <= I_x;
      y_latched <= I_y;
    end if;
  end process;

The last issue was a really irritating one. Irritating as it was a very basic bug in my code. A simple state < 4 check should have been <= 4, meaning the 4 state prolonged an additional cycle, throwing the first pixel off. Easily fixed, and easily spotted in the simulator.

good

The last thing to do was to also try it on my miniSpartan6+ project, and it worked first time – which is great 🙂

Wrap Up

We now have the character generator running off of a 5x pixel clock, with the font/text ram read ports also running at that slower clock. As well as allowing us to run to the Spartan3 FPGA specs of the device I have, it will additionally allow for higher resolutions in the future – especially on the Spartan6 variant.

Thanks for reading, let me know what you think on Twitter @domipheus.