The Boat PC – a marine based Raspberry Pi project

Motivation

In late 2015 I was doing my usual head-scratching about what gifts to get various family members for the holiday season. My wife mentioned making something electronic for my father-in-laws boat, and after a few hours of collecting thoughts came up with an idea:

  • A Raspberry Pi computer, which could be powered off the boats 12v batteries.
  • This computer would have sensors which made sense on a boat. Certainly GPS.
  • I’d have some software which collated the sensor data and displayed it nicely.
  • This could plug into the onboard TV using HDMI.
  • It would all be put into a suitable enclosure.

Excellent – a plan. I expected the hardware part to be easy, the enclosure part fairly straightforward, and the software part to be an absolute disaster. I started searching for an already-existing project to take care of the software side of things.

That’s when I came upon a project called OpenPlotter. It’s a fully-featured linux distribution for Raspberry Pi, specifically for use on a boat, and includes the relevant software for calibrating, collating and transforming data from various sensors into a form that can be used practically. I’ve got to be honest here – OpenPlotter is solid, does exactly what it advertises, and very simple for someone familiar with RPi/Linux to set up and use.

After firmly deciding on OpenPlotter for the software, and knowing I’d be using an old Raspberry Pi 2 I had collecting dust, I looked at what hardware OpenPlotter supported. The list is fairly long, and gave me ideas I had not thought of previously – for example using a USB DVB-T television dongle as an AIS receiver with Software Defined Radio (SDR), allowing real-time data of nearby ships to be displayed. MarineTraffic uses this AIS data, but of course on a boat you can’t rely on an internet connection to pull data from – it’s much better to get the data directly from the VHF signals.

In addition to AIS and GPS, I’d add an Inertial Measurement Unit (IMU – basically an accelerometer, gyroscope and magnetometer in one) in the form of an InvenSense MPU-9150, and also a USB to RS422 converter. RS422 is specified as part of the protocol standard for NMEA 0183, which in turn is the communication specification used in marine electronics. Supporting input and output of direct NMEA using RS422 would allow for some extendibility, for example depth sensors that are already present can feed data into OpenPlotter using this port.

After going and purchasing all of these sensors, I realised that actually using the TV inside the boat isn’t going to be useful, as it’s not visible from the helm. Thankfully, OpenPlotter allows for headless operation, and will automatically set up a WiFi hotspot so you can connect a phone/tablet to the Raspberry Pi and control it using VNC or other software.

The Build

So, to clarify, all the hardware gubbins required:

  • Raspberry Pi 2
  • Invensense MPU9150 board
  • RTL2832U DVB-T USB
  • USB to RS422 Converter
  • USB GPS module
  • USB wifi module

Of course, we need some associated utility to make this into an actual device;

  • 12V to 5V power converter
  • Power switch & connector
  • Status LED
  • Enclosure

When I’ve done projects in the past (the biggest one being PiOnTheWall from years ago), I spend a significant amount searching for the right enclosure to put the hardware in. It’s not just a case of going and getting something that’s big enough to fit the contents, you need to know how thick the sides are, what kind of plastic is it, are there PCB standoffs included, are there vent holes?

After several days, I came up with the following which I got off ebay.

enclosure

I knew already the RTL2832U SDR dongle could run quite hot – so ventilation holes were a must. It’s the hottest part of this hardare, easily 60C+, whilst the Broadcom SOC of the Raspberry Pi will have to be working fairly hard to hit 45C. I did not plan to heatsink anything, and in the end it works fine without them. I did make a concious choice though to have the SDR board at the highest point in the enclosure, closest to the vents.

The design was simple – switch and status LED at the front, RS422, SDR antenna, Power In and Raspberry Pi Mini USB/HDMI/Audio out at the back. I removed all plastic covers from any USB devices, as they just bloated the inside, and I knew removing USB connectors would be a requirement. Laying out the components, I found one which worked well.

layout_annotated

The Raspberry pi would be put on metal standoffs – I used some spares I had from various PC motherboards and cases. I just drilled straight though the bottom of the plastic case with a bit size such that the thread would drive into the plastic.

In my previous Raspberry Pi project I butchered the board, and I’m pleased to say the only thing I had to do in this instance was make the fixing holes on the PCB slightly larger to accommodate the screws for standoffs.

rpi_drill

standoffs

The GPS and Wifi modules remained as dongles, simply connected into one dual header on the Raspberry Pi. To aid fitting all the boards into the enclosure, the male USB connector of the RTL2832U SDR dongle was placed on a ribbon cable. Additionally, the miniUSB cable for the RS422 converter was made small enough to fit in the limited space available. These two boards were physically fixed to the rear panel via bolts, and in the SDR boards case, a little shelf made from spare plastic.

422_sdr_cables

422_sdr_affixed

I’m not very good at making good panel openings, so sadly my HDMI and microUSB ports are very poor. At least they are at the back, where nobody should be able to see them 😉

Internally, all that was left was to connect the 12V->5V DC-DC converter to the Pi, put a power switch inline with the input 12v Power jack, attach the LED to 3v3 (there is a resistor in the LED leg heat-shrink), and fix the rest of it down with the same standoffs. It ended up looking fairly neat and tidy.

complete_internals

For those wondering, I connected the 5V output from the DC-DC converter direct to the 5V rail of the Pi. It bypasses some input protection which exists on the miniUSB power input. For me this is okay, I hoped it would allow the SDR USB dongle to draw more power than is ‘technically’ allowed from the onboard USB ports. I knew that was an issue back in the Raspberry Pi 1 days, and couldn’t remember if that was still the case with RPi 2.

The final rear panel:

complete_rear

The front of the enclosure, unit powered and closed.

complete_front

You will notice the USB socket on the front; I thought it could be useful to trickle charge phones or the tablet that would connect through WiFi to offer controls. I connected the unit to an HDMI monitor to do first-time OpenPlotter setup, making sure the sensors worked, and then switched it into headless mode, with VNC and NMEA 0183 output over it’s own ad-hoc WiFi hotspot.

Testing on the boat!

One thing that I could not test at home and needed to do on the boat was calibrate and test the AIS Receiver. There was a long gap between the hardware being “complete” in summer 2016, and testing it on-board in spring 2017.

AIS runs off VHF frequencies of around 162MHz, a wavelength of 1.85 meters. The boat has a marine antenna already which will work fine, but when I brought the device for testing did not have the correct connector to interface with the SDR dongle.

antenna

Because of this, I made a quick and dirty 1-wire, quarter wavelength antenna. I used a good quality coax, with one end exposing only the inner core to a length of 46 centimeters. I then hooked this around a bit of the boat outside. It wouldn’t get long range, but hoped I’d get some ship signatures in the marina – and it did! After following the calibration instructions on the OpenPlotter guide, I rebooted and after a few minutes the tablet (now connected to the RPi using wifi) displayed the following:

tablet_ais

We used an Android app called SailTracker which takes the collated NMEA datastream and displays the data in an appropriate format. There are several paid apps that come complete with nautical maps, which is neat.

installed

And that’s it! All installed, wired into the 12v, and also now using the VHF antenna at the top of the mast. I’m quite proud with how this one turned out, and I’m very impressed with the OpenPlotter distribution for allowing this project to work as well as it did.

What I’d change

There are 3 things I’d change if I was to do this again:

  1. Changing the front panel LED to RGB, and have it a real status LED rather than power. For example,
    • solid blue: OS booting,
    • flashing green: OpenPlotter starting services,
    • solid green: WiFi hotspot up,
    • red would be an error condition.
  2. Mounting the SDR dongle further in, allowing me to wire up the antenna input from the onboard mini MCX to a PL259 VHF connector on the rear panel. This would have eliminated some of the external complexity of needing various converters.
  3. I’d have a large cover over the microUSB/HDMI/audio raspberry pi connectors, as they are really only needed for debug, and it would have stopped me from making the messy cuts I did 🙂

Thanks for reading. If you have any questions or queries feel free to contact me at @domipheus.

Porting my VHDL Character Generator to Spartan3: Reducing clock speeds and pipelining

This is an article on porting my VHDL character generator from a Xilinx Spartan6 device to one with a Spartan3. It starts off as a simple port, analyzing device primitive differences and accounting for them in the design. Along the way, there were considerations on how clocks were generated, characteristics of block ram timing, and general algorithmic design. I’ll assume you’ve read the sections of my Designing a CPU in VHDL series specifically detailing the implementation of the character generator.

Reading time: 10 minutes

When I first attempted to synthesize my TPU CPU Core design on to the miniSpartan3 developer board (made by the great folks at Scarab Hardware), the bulk of the code went without a hitch. The processor core itself contains no primitive parts specific to a single vendor. However, the rest – Block Rams, Clock Generators – used instantiations of specific device primitives. These are different from family to family of FPGAs and those are where the most thought and investigation is needed, as changes can have knock-on impacts to operations further along the device path.

High level device differences are fairly minimal. On the board itself, we have a 32MHz clock input on the miniSpartan3 board instead of the 50MHZ input on the miniSpartan6 setup. So we will need to change ratios of how we generate pixel and ram clocks for the DVI-D/HDMI video output. The FTDI chip for serial communication is similar and a communications channel is connected to the FPGA. We will need to change the constraints file for the pin definitions as well, but that’s always expected.

Clocks

The Spartan6 TPU design utilizes multiple clocks:

  • Base Clock 50MHz
  • CPU Core 100MHz
  • Pixel 25MHz
  • 5x Pixel 125MHz
  • 5x Pixel Inverted 125 MHz
  • Char/Text Clock 250MHz

The CPU Core and Read/Write Port A’s of all Block Rams use the CPU Core clock. The UART Baud Generator uses the Base Clock. The VGA/Graphics signal generator uses the Pixel clock. The TMDS/DVI-D/HDMI encoders and output buffers use the 5x Pixel and 5x Pixel Inverted clocks. The Character generator, and relevant Port B block rams (Text and Font Rams) use the Char/Text Clock.

When porting to Spartan3, we need to use Digital Clock Managers (DCMs) instead of the Phase Locked Loops (PLLs) on Spartan6. The interface to DCMs is considerably different, but the base terms remain and you can understand what needs to change in the design without much thought.

s6_pll

One of the main issues is that DCMs have much less outputs than the PLLs. On the Spartan6 implementation, a single PLL primitive is used to drive all of the different clocks require. On Spartan3, we will need a DCM for each frequency.

s3_dcm

Due to this, we will require 3 DCM objects. Our Spartan3 chip XC3S200A only has 4 in total, so we are using a significant amount of resources to generate these clocks. However, we do have the available DCMs to get started immediately.

The DCMs themselves have multiple configurations to set up. We use the clock synthesizer(DFS) to get our 25MHz pixel clock from our 32MHz input. The maximum rangers for the DFS is outlined in the Spartan-3A datasheet.

dfs_limits

To generate the pixel clock, we multiply our 32MHz input by 15 to 480MHz then divide by 19 to get 25.2MHz.

-- 32MHz -> ~25MHz
DCM_SP_inst : DCM_SP
generic map (
  CLKDV_DIVIDE => 2.0, --  Divide by: 1.5,2.0,2.5,3.0,3.5,4.0,4.5,5.0,5.5,6.0,6.5
                       --     7.0,7.5,8.0,9.0,10.0,11.0,12.0,13.0,14.0,15.0 or 16.0
  CLKFX_DIVIDE => 19,         --  Can be any interger from 1 to 32
  CLKFX_MULTIPLY => 15,       --  Can be any integer from 1 to 32
  CLKIN_DIVIDE_BY_2 => FALSE, --  TRUE/FALSE to enable CLKIN divide by two feature
  CLKIN_PERIOD => 32.0,       --  Specify period of input clock
  CLKOUT_PHASE_SHIFT => "NONE", --  Specify phase shift of "NONE", "FIXED" or "VARIABLE" 
  CLK_FEEDBACK => "1X",         --  Specify clock feedback of "NONE", "1X" or "2X" 
  DESKEW_ADJUST => "SYSTEM_SYNCHRONOUS", -- "SOURCE_SYNCHRONOUS", "SYSTEM_SYNCHRONOUS" or
                                         --     an integer from 0 to 15
  DLL_FREQUENCY_MODE => "LOW",     -- "HIGH" or "LOW" frequency mode for DLL
  DUTY_CYCLE_CORRECTION => TRUE,   --  Duty cycle correction, TRUE or FALSE
  PHASE_SHIFT => 0,        --  Amount of fixed phase shift from -255 to 255
  STARTUP_WAIT => FALSE)   --  Delay configuration DONE until DCM_SP LOCK, TRUE/FALSE
port map (
  CLK0 => CLK0,     -- 0 degree DCM CLK ouptput
  CLK180 => CLK180, -- 180 degree DCM CLK output
  CLK270 => CLK270, -- 270 degree DCM CLK output
  CLK2X => open,    -- 2X DCM CLK output
  CLK2X180 => open, -- 2X, 180 degree DCM CLK out
  CLK90 => open,    -- 90 degree DCM CLK output
  CLKDV => open,    -- Divided DCM CLK out (CLKDV_DIVIDE)
  CLKFX => clock_pixel_unbuffered,   -- DCM CLK synthesis out (M/D)
  CLKFX180 => CLKFX180, -- 180 degree CLK synthesis out
  LOCKED => LOCKED, -- DCM LOCK status output
  PSDONE => PSDONE, -- Dynamic phase adjust done output
  STATUS => open,   -- 8-bit DCM status bits output
  CLKFB => CLKFB,   -- DCM clock feedback
  CLKIN => clk32_buffered,   -- Clock input (from IBUFG, BUFG or DCM)
  PSCLK => open,    -- Dynamic phase adjust clock input
  PSEN => open,     -- Dynamic phase adjust enable input
  PSINCDEC => open, -- Dynamic phase adjust increment/decrement
  RST => '0'        -- DCM asynchronous reset input
);
	

Block Rams

The block rams on Spartan3 are very similar to the Spartan6 counterparts. They do have different characteristics in terms of timings and therefore maximum operating frequency. The Block Ram primitives on my Spartan3 are not rated for the ~260MHz that those on the Spartan6 run at – so there will be changes required to the Character generator as to account for additional latency in the memory operations.

UART

Thankfully, Xilinx provide the PicoBlaze UART objects for Spartan3 a they do for Spartan6, so there was very little work required in porting these over, apart from using different library objects. The Baud clock routine was changed to strobe correctly using the 32MHz base clock instead of 50MHz on the miniSpartan6. That was the only significant change here.

Differential Signalling Buffers

The OBUFDS output buffers used before can be used on Spartan3, along with the ODDR2 Double Data Rate registers for generating the 10x HDMI signalling.

Character Generator

Most work was on the Character Generator. This was due to the base algorithm of the system needing slight amendments to account for the increased memory latencies and slower clocks. However, I think it’s useful to see what things can happen if we ignore all of that for a second, and simply ‘blind port’ the system ignoring the rated maximum frequencies, just to see what happens.

In fact, the blind port was my first attempt. And this was the result:

There are a few points of interest that you can take from this footage. I’ve singled out a frame to identify them easier.

corruption

  1. The Colour are correct for the areas they should be
  2. The Glyphs seem to be correct
  3. The corruption while random occurs in X directions, as there are bars which are consistent across character locations.

If we look again at the state diagram for the character generator:

text_mode_diagram

As we can tell that the character along with the colour is correct, we know it’s not data corruption in transfer. However, the vertical banding is occurring at the start of the character, indicating the glyph row data is not getting to the system in time.

blockrammhz

In this situation I went to the datasheets and application notes to find maximum frequency ratings for the clock synthesizers and block rams. The 250MHz char/pixel clock is well within specification for generation, but the block rams are only rated for 200MHz. Instead of attempting to redesign the character generator to run off a new slightly slower clock (200MHz), I started modifying it so that it would operate correctly at the 5x pixel serialization clock – as this would free up another DCM object and reduce our utilization from 3 to 2.

The way I started with this problem was to delay the character generator by a single pixel, allowing to pipeline the memory requests up over two pixels instead of one. This would then give us the 10 sub-cycles per pixel.

pipeline_diagram

Table A) shows how 10 sub-cycles are required, table B) shows how they would fit together into a pipelined 5 sub-cycle state machine and C) shows that optimized, as certain stages need only occur when you crossover into a new character. The latching and fetching of the glyph data is idempotent and does not incur additional costs, as the tram data which derives the addresses for glyph rows is only fetched on each 8-pixel character transition.

My first implementation of this seemed to work well enough, apart from there being a duplicated pixel in the glyph.

lastorfirstduplicated

It is harder than you’d think to tell whether this duplicated pixel was at the start or end of glyph processing, so I forced the background colour of the screen to flip-flop between character locations as they were output, allowing you to see the specific zone that a glyph should reside within.

bars_debug

From here, I could tell it was the first pixel which was causing the problem. The first one encapsulates all memory requests – with the further 7 pixels in a glyph row only utilizing a cached version of the data, so from here it was time to go into the simulator and look at some internal signals.

In the Simulator

The first thing to notice was there was disparity between the pixel x coordinate and the actual pixel/5x pixel clocks. Due to the pixel operations being driven from the 5x clock, there could be instances where mid-request there was changes in coordinates. The way to fix this was to have a process driven off of the pixel clock, which then latched the X and Y coordinates, which then the various other logic driven from the 5x clock could utilize.

latching_coords_2

You can see in the above waveform the issue clearly. At (a) we kick off a new initial glyph row request (State 1 is only ever entered on the first pixel of a character row). If we do not latch the coordinate, half way though the request at (b) we could have the coordinate flip.

Since we already had a process running from the pixel clock to manage the blinker flags, this was a simple addition.

  -- This process latches the X and Y on the pixel clock
  -- Also manages the blinker.
  process(I_clk_pixel)
  begin
    if rising_edge(I_clk_pixel) then
      blinker_count <= blinker_count + 1;
      x_latched <= I_x;
      y_latched <= I_y;
    end if;
  end process;

The last issue was a really irritating one. Irritating as it was a very basic bug in my code. A simple state < 4 check should have been <= 4, meaning the 4 state prolonged an additional cycle, throwing the first pixel off. Easily fixed, and easily spotted in the simulator.

good

The last thing to do was to also try it on my miniSpartan6+ project, and it worked first time – which is great 🙂

Wrap Up

We now have the character generator running off of a 5x pixel clock, with the font/text ram read ports also running at that slower clock. As well as allowing us to run to the Spartan3 FPGA specs of the device I have, it will additionally allow for higher resolutions in the future – especially on the Spartan6 variant.

Thanks for reading, let me know what you think on Twitter @domipheus.

Designing a CPU in VHDL, Part 14: ISA changes, software interrupts and bugfixing that BIOS code

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

It’s finally that time! I have committed the latest TPU VHDL, assembler and ISA to my github repository. Fair warning: The assembler is _horrid_. The VHDL contains a ISE project for the LX25 miniSpartan6+ variant, but the font ROM is empty. The font ROM I’ve been testing with doesn’t have a license attached and I don’t want to blindly add it here. You can, however simulate with ISim, and directly inspect the Text Ram blocks in ASCII mode to see any text displayed. I will explain that more later in the part.

The ‘BIOS’ code

So just now we have a TPU program which prints a splash message, checks for the size of the contiguous memory area from 0x0000 (slowly), and then waits for a command from the input UART. This UART is connected to the FTDI chip on port B, so appears as a COM port to a computer when the USB cable is connected to the miniSpartan6+. It only accepts a single character command just now, mainly because I have chosen the path of progress here rather than the path of lovely command-line words, which involves several string functions that honestly nobody should need to waste time writing (yes, looking at you, tech interviewers).

Getting to this point, without even writing significant code to handle a command, I realized that the code was so big (~1.5KB) that I’d have trouble fitting it into a single block ram. TASM, the Tpu ASseMbler, currently only outputs a single block ram initializer, and the VDHL duplicates that single intializer across each memory bank, so it would be a lot of work to fix all of that. I instead wanted to look at why exactly there is so much code, for such little functionality.

I edited TASM to output instruction metrics, simply a count for each operation. I then checked what was the biggest, ignoring the define word (dw) operation, as string constants were significant but not really relevant. This was the result:

before_instcount_pie

So you look at that and scan for counts that don’t make any sense, there are a few that jump right out. For instance, Branch Immediate (bi) only used twice in all this code? 1400 lines of assembly and only 2 branch immediates?

When investigating the code, I realized why. The 16 bit op codes don’t leave a lot of room for immediate values, even if they are shifted to account for 2-byte alignment of instructions. To play it safe, I was instead loading into the high and low bytes of registers, and branching to a register. So 4 instructions (load.l, load.h, or, br) instead of a single branch immediate. There were also lots of function calls, and those branches were performed using Branch Register.

More puzzling was the differences in write.w and read.w count. 111 write words, versus 38 read words? Given most of the reads and writes in the code was stack manipulation for the various function calls, it seemed like there was significant overspill saving registers that were then not required to be read (basically, myself being over cautious).

So, I decided to perform two changes to the ISA:

First, I would remove the Branch Immediate instruction, replacing it with a Branch Immediate Relative Offset. This will allow for the largest possible range, and help with eventual Position Independent Code, as the immediate is a signed offset from the current program counter. Perfect for use when jumping around inside functions.

Second, I will introduce an interrupt instruction, int. This will allow for the software-defined invocation of the interrupt handler, with an immediate value being provided to the Interrupt Event Field. Then I will be able to replicate the old IBM/DOS BIOS routines – where interrupts were signaled to perform low-level tasks.

So, here are the two new instructions. They are part of ISA 1.5, which is on github:

int_biro_defns

Calling conventions

Currently, my code sets up a stack at a known high address. This could then be reset to another location after the memory test, but isn’t done just yet. The stack is pointed to by r7, and currently I’ve defined all other registers r0-r6 to be volatile for the purpose of calling conventions. This simply means that if you need the value of a register preserved across a function call, you must save the value onto the stack before you call. I mentioned function calling and the instructions used in a previous part, but here is a reminder of how it’s done:

# Call setcursor (9, 2)

subi    r7, r7, 2             #reserve stack for preserve
write.w r7, r0, 0             #we want to preserve r0
  
subi    r7, r7, 6             #reserve stack for call
load.l  r1, 9                 #2 word arguments and a return address = 6 bytes
load.l  r2, 2
write.w r7, r1, 2             #arg0 x=9
write.w r7, r2, 4             #arg1 y=2

spc     r6                    #get the current PC value
addi    r6, r6, 14            #offset that value past the branch below
write.w r7, r6                #save the return pc

load.h  r0, $setcursor.h      
load.l  r1, $setcursor.l
or      r0, r0, r1            #build a register containing function location
br      r0                    #branch to the function label
   
addi    r7, r7, 6             #restore stack 

read.w  r0, r7, 0             #read the old r0 back
addi    r7, r7, 2             #restore stack (preserve)

So lots of writes and bloat for every call, but it works well enough. I have return values passed back in registers. By implementing BIOS routines for these I/O functions using software interrupts, we should reduce the code size considerably.

BIOS Routines

The new int instruction allows user code to jump into the interrupt handler, with a user-defined Interrupt Event Field. The space in the immediate area of the opcode allows for 64 values, so we define Interrupt Event Field values less than 64 to be only applicable to software routines.

In the interrupt handler, all registers except r7 are immediately saved to the stack. Of course this means the stack must always have enough space for this, but assume that is true for now. If the Event Field is < 64, we use this value to get a function address from a table. We then setup a function call and branch. On return, we save r0 back to the stack, directly where it will be restored from. This allows for data to be returned from BIOS function calls. Currently, the BIOS table is as follows: interrupts

So the basics are there just now. Enough to make a test with input from UART and output to text-mode display. Note the division by zero entry – there is no hardware divide. I have a naive division function, and throws int 0‘s when division by zero is encountered.

There are obviously restrictions to this; for example, you really should not use the int instruction when inside of an interrupt handler, but it can happen, nothing stops you from doing it. Keeping to these restrictions will be tricky. I have started looking at how the hardware can disable/restore interrupt enable states and also handle when an invalid int instruction is encountered.

I went and moved all the code of my simple bios example over to this new system based on software interrupts. I also changed Branch Immediates to Branch Immediate Relative Offset, but didn’t fish out other opportunities for it significantly.

The output on screen was corrupt. data was being clobbered somewhere, so I needed to debug..

Debugging the issue

TPU does not have debugging functionality for the software running on it, but, you can use the ISIM simulator, and edit the assembly code to at least allow for some ‘what are the registers at point X’ information.

memory__regs

It is incredibly time consuming, but invaluable when things go wrong, which inevitably did when I tried out using the int instruction. I’ve mentioned in previous parts than in ISim you can inspect the contents of the internal Block Ram memories and other signals. This allows you to see register values, and the contents of the text ram. You can set up the data view to show ASCII, and get a representation of the characters you’d want displayed.

memory_tram

What I tend to do is place a branch to self such as “biro 0” where I want to inspect registers, then run the code in the simulator. I could also use the TPU emulator, but I’ve yet to implement the full set of opcodes in it yet. Doing this around various places led me to discover that the r0 register was being overwritten on execution of int instructions. The value written was always 0x0008 – which to those who have read all previous TPU parts may recognize as the interrupt vector.

The way the ALU works is that various signals are always calculated and made available, but then the relevant one is selected, usually by a signal from the decoder.

when OPCODE_SPEC_F2_INT =>
    s_result(15 downto 0) <= ADDR_INTVEC;
    s_interrupt_rpc <= std_logic_vector(unsigned(I_PC) + 2);
    s_interrupt_register <= X"00" & "00" & I_dataIMM(7 downto 2);
    
    s_prev_interrupt_enable <= s_interrupt_enable;
    s_interrupt_enable <= '0';
    s_shouldBranch <= '1';	

The instruction is implemented in the ALU simply – like a standard branch. If s_shouldBranch is ‘1’, the PC unit takes the next PC from the output of the ALU, which is s_result(15 downto 0) – the Interrupt Vector ADDR_INTVEC (0x0008).

Looking back at the int instruction layout:

int_inst

You can see that the bits where a destination register should be (11-8) in an Rd-form instruction are all 0. The issue we have is that register r0 is having 8 written to it after an int instruction, and in the same way a branch uses the output of the ALU, if a register write is requested, the same value – the output of the ALU – will be used as the source data.

Looking at the decoder, I saw that I’d forgot to add the new instruction. I’d only added it to the ALU block. The decoder sets the write enable signal for the destination register:

case I_dataInst(IFO_F2_BEGIN downto IFO_F2_END) is
  when OPCODE_SPEC_F2_GETPC =>
    O_regDwe <= '1';
  when OPCODE_SPEC_F2_GETSTATUS =>
    O_regDwe <= '1';
  when others =>
end case;

With O_regDwe not being set in the event of the OPCODE_SPEC_F2_INT case, it would preserve it’s previous state. Before the int instruction as usually an immediate load into a register – meaning the regDwe signal would be left enabled, and produce the exact incorrect behavior I saw.

case I_dataInst(IFO_F2_BEGIN downto IFO_F2_END) is
  when OPCODE_SPEC_F2_GETPC =>
    O_regDwe <= '1';
  when OPCODE_SPEC_F2_GETSTATUS =>
    O_regDwe <= '1';
  when OPCODE_SPEC_F2_INT =>
    O_regDwe <= '0';
  when others =>
end case;

Quickly adding the case for our new INT instruction, things worked perfectly.

New instruction stats

Now that we have the int instruction, I collated the number of instructions for the same functions, and got the expected reduction in code size.

after_instcount_pie

instruction_count

Finding more bugs

It’s expected that as the amount of TPU code I write grows, the more issues I find in various parts of it’s implementation. Some are obvious bugs, whilst others may be hazards in the CPU itself. One such bug was in the Memory Interrupt Process, only introduced in the last part. It only ever fired once, due to a check for the state-machine state being in the wrong place:

...
if rising_edge(cEng_clk_core) and MEM_access_int_state = 0 then
    if MEM_Access_error = '1' then
      I_int <= '1';
...

The check for MEM_access_int_state should have been after the clock edge check:

...
if rising_edge(cEng_clk_core) then
			if MEM_Access_error = '1' and MEM_access_int_state = 0 then
				I_int <= '1';
...

With this edit in place, multiple memory violation interrupt requests can take place. This is required for some code that I wrote which displays the ‘memory map’ of the system. A read is performed at each 2KB boundary, and if an interrupt violation is encountered it is assumed the whole bank/block is unmapped. This continues until all 32 banks are checked – and displays this to the screen as it happens.

memmap

Explanation:

  • 0x0000 – 0x3FFF: RAM
  • 0x9000 – 0x97FF: Memory mapped I/O
  • 0xA000 – 0xAFFF: Font Ram
  • 0xB000 – 0xBFFF: Text Ram
  • 0xC000 – 0xEFFF: VRAM

Getting back to the start of this post, where we discussed simple command execution, this memory map prints out if the ‘m’ command is entered through the UART. There are two other commands currently, ‘l’ for lighting all LEDs and ‘d’ for darkening them.

There is also a current issue where UART input seems to generate multiple instances of the same character from my read function. I’ve yet to look at that problem in detail just yet.

Wrap Up

So in this part, we’ve investigated some ‘BIOS’ code, modified the branch immediate instruction, added a software interrupt instruction, and fixed some issues in their implementation.

At the moment, I’m finding the hardest part of TPU now bad tooling. TASM for assembling code was good and served its purpose when I needed a quick fix for generating small tests. However, now that I’m thinking of sending program code over UART which is ‘compiled’ as position independent code, TASM is showing significant weaknesses. I’ll need to get around that somehow.

For now that’s the end of this part. Many thanks for reading. As always, if you have any comments of queries, please grab me on twitter @domipheus.

Getting Started with the miniSpartan3 FPGA board

The folks over at Scarab Hardware, who make the miniSpartan6+ board I do most of my FPGA tinkering on, kindly provided me with one of their other devices – the miniSpartan3.

miniSpartan3 is a smaller board, with less features and a Spartan3 Xilinx FPGA instead of the newer generation Spartan6. However, it is very competitively priced, with the board I received costing only $39 – which is a bargain for a small dev board with HDMI out, really.

I thought I’d write a small post about ho to get this board set up and running some “hello world” test. To do this, we need a few things:

  • Xilinx ISE 14 WebPack
  • A HDL design to run
  • A UCF constraints file for the miniSpartan3 board
  • xc3sprog to program the board

Lets get started!

ISE WebPack

This is the official Xilinx development environment for the Spartan3 FPGA family. It has a free version, but is a significant download – you can find it here. The instructions on my “Designing a CPU in VHDL” part are still correct, so you can follow them to obtain the installer.

It’s worth noting that ISE has problems on Windows 10 systems. It installs, but crashes when open file dialogs present. There is a patched set of DLLs which aim to solve this, and a few videos you can find on youtube explaining the steps. I’ve used these and it seems to run fine now.

ise_start

On starting the ISE Project Navigator (I run the 64-bit version) you can start a new project. Throw in a name, location, and optional description. Our top-level source is HDL.

ise_newproj

Now we must specify the settings for the project. The miniSpartan3 board is (unsurprisingly) a Spartan3A family FPGA part. Specifically my board is the XC3S200A part, in the VQ100 package. If you are following me in VHDL, set that as your preferred language and progress to the next screen.

ise_projsettings

The next screen is simply a summary. Ensure here the device and package is correct 🙂

ise_summary

Once the project is created we are presented with an empty project hierarchy. We shall add some new source to it.

ise_new_source

The source we want is a VHDL module. I’ve called mine led_top, as it is going to use the LEDs and is out top-level module.

ise_new_module

ise_define_module

The module will have 2 inputs and one output port. The inputs are the clock, and our two on-board switches. The output is the three LEDs present on the miniSpartan3 board. We can use the new source wizard to define the interface boilerplate for us – the 3 LEDs would be defined as an output bus, with 3 bits – MSB 2 down to LSB 0. Completing the wizard creates source for us and we can then add our additional logic to it.
ise_first_source

Our Hello World project

The end goal for our hello world project is simple:

  • One switch will enable our LED counter output, or fix them all as lit
  • The second switch will select a fast binary count, or slow binary count

The LEDs will count in binary from 0 to 7. Internally, there will be a 32-bit counter, which increments every cycle of our input 32MHz clock. The switch will simply select a different sub-range of bits to view into this 32-bit number. Due to it incrementing every cycle, the windows will be in the high bit range, say bits 24, 25 and 26. This should allow you to see the LEDs move, instead of them being a blur.

To achieve this, we need to introduce our counter. It is a signal within the behavioral led_top architecture definition.

architecture Behavioral of led_top is
  signal count: unsigned(31 downto 0) := X"00000000";
  
  ...snip...

This count signal is of type unsigned, which is compatible with the STD_LOGIC_VECTOR type of our LED output. Unlike STD_LOGIC_VECTOR, it has arithmetic operations defined – which you can use in projects by uncommenting the arithmetic package line near the top of the autogenerated boilerplate.

-- Uncomment the following library declaration if using
-- arithmetic functions with Signed or Unsigned values
use IEEE.NUMERIC_STD.ALL;

The LEDs are a vector of 3 std_logic bits. We will also define a signal to hold their state, which will capture the relevant values of bits in the counter each cycle.

	signal threeBits: std_logic_vector(2 downto 0) := "000";

We then define a process block, which is ‘run’ every time the state of I_clk changes.

  process(I_clk)
  begin
    if rising_edge(I_clk) then
      count <= count + 1;

      if I_switches(1) = '1' then
        threeBits <= std_logic_vector(count(25 downto 23));
      else
        threeBits <= std_logic_vector(count(28 downto 26));
      end if;
    end if;
  end process;

Following this code, each rising edge of the input clock cycle, we will increment the unsigned count signal by 1. Then, depending on the state of our second switch, we will capture three bits of data from that counter, and place them in our ‘threeBits’ signal.

Outside of a process, we have an asynchronous assign to the output LEDs.

O_leds <= threeBits when I_switches(0) = '1' else "111";

This will output the three bits of our counter to the LEDs when our first switch is active, otherwise output a state in which all LEDs are on.

The full source for this small sample, for clarity, is below.

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.NUMERIC_STD.ALL;

entity led_top is
    Port ( I_clk : in  STD_LOGIC;
           O_leds : out  STD_LOGIC_VECTOR (2 downto 0);
        I_switches : in STD_LOGIC_VECTOR (1 downto 0));
end led_top;

architecture Behavioral of led_top is
  signal count: unsigned(31 downto 0) := X"00000000";
  signal threeBits: std_logic_vector(2 downto 0) := "000";
begin

   process(I_clk)
   begin
    if rising_edge(I_clk) then
      count <= count + 1;
     
     if I_switches(1) = '1' then
      threeBits <= std_logic_vector(count(25 downto 23));
     else
      threeBits <= std_logic_vector(count(28 downto 26));
     end if;
    end if;
   end process;

   O_leds <= threeBits when I_switches(0) = '1' else "111";

end Behavioral;

Now that we have a top-level VHDL module for the behavior we require, we need to now ensure those inputs and outputs are mapped to the correct pins on the miniSpartan3 board.

Creating the constraints file

The constraints file contains mappings from net names to pins on our board, along with relevant standards for what kind of I/O is used. We create a User Constraints File (.ucf) using the New Source option as previously used for our VHDL module.

ise_ucf

We need to define mappings for our I_clk, I_switches and O_leds nets to correct pins on the FPGA package. For that, we need to inspect the schematic of the miniSpartan3 board, which is provided on the github project from Scarab Hardware.

pins

We can see from the schematic that our LEDS, switches and clock input are clearly defined. The UCFfile is a list of NET names to LOC pin locations, along with modifiers. For example, we can see the I_clk input is on pin P85. So we define that as so:

# 32 MHz clock
NET "I_clk" LOC = "P85"; 

We can also follow the lines from the LEDs to the FPGA inputs, and get their mappings.

# Leds
NET "O_leds<2>" LOC = "P16";  
NET "O_leds<1>" LOC = "P19";  
NET "O_leds<0>" LOC = "P20"; 

When it comes to the switches however, we need to give some more information. If you look at the schematic, you can see the switches connect directly from the fpga, though the switch, to ground. This means that we need the input to be ‘pulled up’ via an internal resister, to the high logic level. This basically means that unless the input pin is forced to ground, it will be pulled high. In this case, we state thee net is also PULLUP.

# Switches
NET "I_switches<0>" LOC = "P98" | PULLUP;
NET "I_switches<1>" LOC = "P99" | PULLUP;

with those 6 NET definitions in our UCF file, we can now go ahead and create a programming file.

generate

This will create a led_top.bit file in our project folder. We will flash this file onto the miniSpartan3 board.

Flashing the miniSpartan3

Normally you flash a small SPI memory chip which FPGAs read from on power-up, but the easiest way to get your design working is to just flash the FPGA on a 1 time basis. This means you need to re-flash each power cycle and you cannot reset it, but is fine for this Hello World example. To do this, you can use a program called xc3sprog. More information is available here.

“xc3sprog is a suite of utilities for programming Xilinx FPGAs, CPLDs, and EEPROMs with the Xilinx Parallel Cable and other JTAG adapters under Linux.”

Despite the blurb above, it also works under Windows, which is my primary platform just now. You can find a built windows binary on sourceforge.

xc3sprog -c ftdi led_top.bit

xc3sprog

One-time flashes the device, and we get a working Hello World example! Note there is a failure message there, but it works.

Wrap Up

I hope this is useful to folks! If you have any questions, you can find me on twitter @domipheus. This project is available on github.