The Boat PC – a marine based Raspberry Pi project

Motivation

In late 2015 I was doing my usual head-scratching about what gifts to get various family members for the holiday season. My wife mentioned making something electronic for my father-in-laws boat, and after a few hours of collecting thoughts came up with an idea:

  • A Raspberry Pi computer, which could be powered off the boats 12v batteries.
  • This computer would have sensors which made sense on a boat. Certainly GPS.
  • I’d have some software which collated the sensor data and displayed it nicely.
  • This could plug into the onboard TV using HDMI.
  • It would all be put into a suitable enclosure.

Excellent – a plan. I expected the hardware part to be easy, the enclosure part fairly straightforward, and the software part to be an absolute disaster. I started searching for an already-existing project to take care of the software side of things.

That’s when I came upon a project called OpenPlotter. It’s a fully-featured linux distribution for Raspberry Pi, specifically for use on a boat, and includes the relevant software for calibrating, collating and transforming data from various sensors into a form that can be used practically. I’ve got to be honest here – OpenPlotter is solid, does exactly what it advertises, and very simple for someone familiar with RPi/Linux to set up and use.

After firmly deciding on OpenPlotter for the software, and knowing I’d be using an old Raspberry Pi 2 I had collecting dust, I looked at what hardware OpenPlotter supported. The list is fairly long, and gave me ideas I had not thought of previously – for example using a USB DVB-T television dongle as an AIS receiver with Software Defined Radio (SDR), allowing real-time data of nearby ships to be displayed. MarineTraffic uses this AIS data, but of course on a boat you can’t rely on an internet connection to pull data from – it’s much better to get the data directly from the VHF signals.

In addition to AIS and GPS, I’d add an Inertial Measurement Unit (IMU – basically an accelerometer, gyroscope and magnetometer in one) in the form of an InvenSense MPU-9150, and also a USB to RS422 converter. RS422 is specified as part of the protocol standard for NMEA 0183, which in turn is the communication specification used in marine electronics. Supporting input and output of direct NMEA using RS422 would allow for some extendibility, for example depth sensors that are already present can feed data into OpenPlotter using this port.

After going and purchasing all of these sensors, I realised that actually using the TV inside the boat isn’t going to be useful, as it’s not visible from the helm. Thankfully, OpenPlotter allows for headless operation, and will automatically set up a WiFi hotspot so you can connect a phone/tablet to the Raspberry Pi and control it using VNC or other software.

The Build

So, to clarify, all the hardware gubbins required:

  • Raspberry Pi 2
  • Invensense MPU9150 board
  • RTL2832U DVB-T USB
  • USB to RS422 Converter
  • USB GPS module
  • USB wifi module

Of course, we need some associated utility to make this into an actual device;

  • 12V to 5V power converter
  • Power switch & connector
  • Status LED
  • Enclosure

When I’ve done projects in the past (the biggest one being PiOnTheWall from years ago), I spend a significant amount searching for the right enclosure to put the hardware in. It’s not just a case of going and getting something that’s big enough to fit the contents, you need to know how thick the sides are, what kind of plastic is it, are there PCB standoffs included, are there vent holes?

After several days, I came up with the following which I got off ebay.

enclosure

I knew already the RTL2832U SDR dongle could run quite hot – so ventilation holes were a must. It’s the hottest part of this hardare, easily 60C+, whilst the Broadcom SOC of the Raspberry Pi will have to be working fairly hard to hit 45C. I did not plan to heatsink anything, and in the end it works fine without them. I did make a concious choice though to have the SDR board at the highest point in the enclosure, closest to the vents.

The design was simple – switch and status LED at the front, RS422, SDR antenna, Power In and Raspberry Pi Mini USB/HDMI/Audio out at the back. I removed all plastic covers from any USB devices, as they just bloated the inside, and I knew removing USB connectors would be a requirement. Laying out the components, I found one which worked well.

layout_annotated

The Raspberry pi would be put on metal standoffs – I used some spares I had from various PC motherboards and cases. I just drilled straight though the bottom of the plastic case with a bit size such that the thread would drive into the plastic.

In my previous Raspberry Pi project I butchered the board, and I’m pleased to say the only thing I had to do in this instance was make the fixing holes on the PCB slightly larger to accommodate the screws for standoffs.

rpi_drill

standoffs

The GPS and Wifi modules remained as dongles, simply connected into one dual header on the Raspberry Pi. To aid fitting all the boards into the enclosure, the male USB connector of the RTL2832U SDR dongle was placed on a ribbon cable. Additionally, the miniUSB cable for the RS422 converter was made small enough to fit in the limited space available. These two boards were physically fixed to the rear panel via bolts, and in the SDR boards case, a little shelf made from spare plastic.

422_sdr_cables

422_sdr_affixed

I’m not very good at making good panel openings, so sadly my HDMI and microUSB ports are very poor. At least they are at the back, where nobody should be able to see them 😉

Internally, all that was left was to connect the 12V->5V DC-DC converter to the Pi, put a power switch inline with the input 12v Power jack, attach the LED to 3v3 (there is a resistor in the LED leg heat-shrink), and fix the rest of it down with the same standoffs. It ended up looking fairly neat and tidy.

complete_internals

For those wondering, I connected the 5V output from the DC-DC converter direct to the 5V rail of the Pi. It bypasses some input protection which exists on the miniUSB power input. For me this is okay, I hoped it would allow the SDR USB dongle to draw more power than is ‘technically’ allowed from the onboard USB ports. I knew that was an issue back in the Raspberry Pi 1 days, and couldn’t remember if that was still the case with RPi 2.

The final rear panel:

complete_rear

The front of the enclosure, unit powered and closed.

complete_front

You will notice the USB socket on the front; I thought it could be useful to trickle charge phones or the tablet that would connect through WiFi to offer controls. I connected the unit to an HDMI monitor to do first-time OpenPlotter setup, making sure the sensors worked, and then switched it into headless mode, with VNC and NMEA 0183 output over it’s own ad-hoc WiFi hotspot.

Testing on the boat!

One thing that I could not test at home and needed to do on the boat was calibrate and test the AIS Receiver. There was a long gap between the hardware being “complete” in summer 2016, and testing it on-board in spring 2017.

AIS runs off VHF frequencies of around 162MHz, a wavelength of 1.85 meters. The boat has a marine antenna already which will work fine, but when I brought the device for testing did not have the correct connector to interface with the SDR dongle.

antenna

Because of this, I made a quick and dirty 1-wire, quarter wavelength antenna. I used a good quality coax, with one end exposing only the inner core to a length of 46 centimeters. I then hooked this around a bit of the boat outside. It wouldn’t get long range, but hoped I’d get some ship signatures in the marina – and it did! After following the calibration instructions on the OpenPlotter guide, I rebooted and after a few minutes the tablet (now connected to the RPi using wifi) displayed the following:

tablet_ais

We used an Android app called SailTracker which takes the collated NMEA datastream and displays the data in an appropriate format. There are several paid apps that come complete with nautical maps, which is neat.

installed

And that’s it! All installed, wired into the 12v, and also now using the VHF antenna at the top of the mast. I’m quite proud with how this one turned out, and I’m very impressed with the OpenPlotter distribution for allowing this project to work as well as it did.

What I’d change

There are 3 things I’d change if I was to do this again:

  1. Changing the front panel LED to RGB, and have it a real status LED rather than power. For example,
    • solid blue: OS booting,
    • flashing green: OpenPlotter starting services,
    • solid green: WiFi hotspot up,
    • red would be an error condition.
  2. Mounting the SDR dongle further in, allowing me to wire up the antenna input from the onboard mini MCX to a PL259 VHF connector on the rear panel. This would have eliminated some of the external complexity of needing various converters.
  3. I’d have a large cover over the microUSB/HDMI/audio raspberry pi connectors, as they are really only needed for debug, and it would have stopped me from making the messy cuts I did 🙂

Thanks for reading. If you have any questions or queries feel free to contact me at @domipheus.

A UART Implementation in VHDL

I’m still working on my Soft-CPU TPU, but wanted to implement a communications channel for it to use in order to get some form of input and output from it. The easiest way to do this is to use a UART, and connect it to a USB to Serial converter for logic-level asynchronous communications.

Knowing that I’m still pretty new to VHDL and working with FPGA systems in general at this level, I decided to develop my own UART implementation. Some may roll their eyes at this, knowing there are plenty out there, and even constructs to utilize real hardware on the Spartan 6 FPGA I’m using; but I’m a fan of learning by doing.

Serial Communications

What I’m implementing is a transmitter and receiver which can operate at any baud rate, with 8 data bits, no parity and 1 stop bit. It should be able to communicate over a COM post to a PC, or to another UART. It’s working at Logic-Level voltages, which is very important – you need to use a logic level USB-Serial cable for this. Using an RS232 serial will damage things if it uses the higher voltages specified.

Looking at how we transmit, the waveform looks as follows:

txAssuming that the ‘baud’ clock is running at the correct frequency we require, you can see that it’s fairly simple how all of this works. The idle state for the TX line is always logic high. This may seem weird, but historically the distances the wires crossed meant they were susceptible to damage, and having the idle state high meant if any problem occurred with the physical wires, you’d know about it very quickly.

To transmit an 8-bit byte, a start bit is emitted which is logic low. One ‘baud tick’ later, the least significant bit of the byte is sent, and then every baud tick follows the next bit until the most significant bit is sent. Finally, a stop bit is sent, which is logic high. At this point another byte can be sent immediately – or the line left idle to transmit later, after a delay.

Transmitter States

The transmitter is very simple. There is a data byte input, and a txSig port which is used to signal that the bits on the data output should be sent. When txSig is asserted, state moves from idle to a start state where a start bit is issued. From there, we progress to the data state, where the 8 bits of data are pushed least-significant-bit to output. Finally there is the stop bit state, before moving back to idle, or straight back to start in the case data is being streamed out.

tx_statesFor the states, I use an integer signal as it seemed the simplest and generally the most obvious way to go about it. The whole transmitter code is below.

tx_proc: process (tx_clk, I_reset, I_txSig, tx_state)
begin
  -- TX runs off the TX baud clock
  if rising_edge(tx_clk) then
    if I_reset = '1' then
      tx_state <= 0;
      tx_data <= X"00";
      tx_rdy <= '1';
      tx <= '1';
    else
      if tx_state = 0 and I_txSig = '1' then
        tx_state <= 1;
        tx_data <= I_txData;
        tx_rdy <= '0';
        tx <= '0'; -- start bit
      elsif tx_state < 9 and tx_rdy = '0' then
        tx <= tx_data(0);
        tx_data <= '0' & tx_data (7 downto 1);
        tx_state <= tx_state + 1;
      elsif tx_state = 9 and tx_rdy = '0' then
        tx <= '1'; -- stop bit
        tx_rdy <= '1';
        tx_state <= 0;
      end if;
    end if;
  end if;
end process;

As you can see from the VHDL, the start state is tx_state=0, the data state is tx_state=1..8 and the stop state is tx_state=9. The process is idle when tx_state is 0 with I_txSig=0. The tx_clk baud clock is generated from the higher-frequency system clock using a counter:

  -- TX standard baud clock, no reset
  if tx_clk_counter = 0 then
    -- chop off LSB to get a clock
    tx_clk_counter <= to_integer(unsigned(I_clk_baud_count(15 downto 1)));
    tx_clk <= not tx_clk;
  else
    tx_clk_counter <= tx_clk_counter - 1;
  end if;

The way the I_clk_baud_count value is set is as follows:

I_clk in Hz / expected baud = I_clk_baud_count

So, for 9600bps on a system using a 50MHz clock, one would assign 50000000/9600, or 5208, to I_clk_baud_count. The TX clock is generated by negating the clock signal every 5208/2 counts of the system clock.

Testing the transmitter

Testing this is fairly simple. Auto-generating a test bench, all we need to do is put data on the in port, and then toggle the txSig signal input.

-- send hello\n - 0x48, 0x45, 0x4c, 0x4c, 0x4f, 0x0d
-- H
I_txData <= X"48";
I_txSig <= '1';
wait until O_txRdy= '0';
I_txSig <= '0';

 -- E
wait until O_txRdy= '1';
I_txData <= X"45";
I_txSig <= '1';
wait until O_txRdy= '0';
I_txSig <= '0';

...snip...

tx_tb_workingThe simulation waveform results in correct TX output, which is great. I wrote up a top-level vhdl module and flashed the MiniSparan6+ board with the same style of test (obviously not using waits – it just endlessly looped over an array containing ‘hello’) and connecting to the FPGA via putty showed the TX did indeed work for this case. Time to implement receive!

Receiver

Receiving needs to be implemented differently from transmit. That statement is obvious, but it’s all about how the timing is managed and where the serial input is sampled.

If we use our existing example of how we generated the TX output, and use those methods for RX, the below waveform will be the ideal situation.

rx_idealSampling on the rising edge of baud_clk we can see the sampled data is correct; a start bit, 8 did bits, and the stop bit (named ‘e’ just to differentiate). However, we do not control the timing of the RX input. It can be out of phase with the clock, and once it’s out of phase significantly the samples can result in incorrect data, as shown below. Additionally, there is a percentage error allowed in serial communications, and as this error accumulates it can confuse the receiver.

rx_badWe need to use a higher-rate clock, as to lower the accumulated error across a received frame.

Super-sampling (or not)

Generally when working with other UART RX hardware you’ll see mention of a clock at higher than the baud rate. This is due to the internals super-sampling the RX input and then trying to get the ideal sample area, right in the middle of the transmitted bit. For my implementation I cheated, and still only sample once per bit, but I use a 16x baud tick along with a counter for working out where the next bit is likely to be.

The falling edge of the start bit is always sampled at the system clock, in my case, 50MHz. When found, the 16x baud counter is reset, which re-aligns the baud ticks with the start bit. A counter is reset too; as there are 16 baud ticks per bit when receiving, I then sample the start bit when the counter reaches 7, move to the next state, and reset. It will then sample for data bit 0 when the counter reaches 16, and so on, until we have a whole byte of data and an end bit.

sampling_diagramYou can see the 16 baud ticks per transmitted bit in the simulator:

16x_baud_sample_ticksAnd, even see the ticks reset to re-align them with the RX when the start bit is sampled:

rx_start_bit_edge_clk_reset_croppedThe RX Code

The RX clock is based off two things, the main system clock, and then a counter which generates a ‘tick’ every 16x baud clock (the one TX uses). To generate the ticks, we use the code below.

-- RX baud 'ticks' generated for sampling, with reset
if rx_clk_counter = 0 then
  -- x16 sampled - so chop off 4 LSB
  rx_clk_counter <= to_integer(unsigned(I_clk_baud_count(15 downto 4)));
  rx_clk_baud_tick <= '1';
else
  if rx_clk_reset = '1' then
    rx_clk_counter <= to_integer(unsigned(I_clk_baud_count(15 downto 4)));
  else
    rx_clk_counter <= rx_clk_counter - 1;
  end if;
  rx_clk_baud_tick <= '0';
end if;

There are a few things that could probably be optimized here, but I kept it simple for readability reasons. Note that the RX counter generating the baud ticks has a reset, unlike the transmit clock.

The actual code for the RX is presented below, with the reset, and initial start bit detection.

rx_proc: process (I_clk, I_reset, I_rx, I_rxCont)
begin
  -- RX runs off the system clock, and operates on baud 'ticks'
  if rising_edge(I_clk) then
    if rx_clk_reset = '1' then
      rx_clk_reset <= '0';
    end if;
    if I_reset = '1' then
      rx_state <= 0;
      rx_sig <= '0';
      rx_sample_count <= 0;
      rx_sample_offset <= OFFSET_START_BIT;
      rx_data <= X"00";
      O_rxData <= X"00";
    elsif I_rx = '0' and rx_state = 0 and I_rxCont = '1' then
      -- first encounter of falling edge start
      rx_state <= 1; -- start bit sample stage
      rx_sample_offset <= OFFSET_START_BIT;
      rx_sample_count <= 0;
      
      -- need to reset the baud tick clock to line up with the start 
      -- bit leading edge.
      rx_clk_reset <= '1';

Skipping the clock reset, which needs to be in this process (this process writes that signal, the other clock-generating process reads it) we have the initial state for the receiver, rx_state=0. This is the initial detection of the start bit, which is rx=’0′, and sampled every system clock cycle (50MHz). Once we find these, and the rxCont input is active (which is basically RX enable) we move to state 1 and set the sample offset to OFFSET_START_BIT, which I can assure you is 7!

    elsif rx_clk_baud_tick = '1' and I_rx = '0' and rx_state = 1 then
      -- inc sample count
      rx_sample_count <= rx_sample_count + 1;
      if rx_sample_count = rx_sample_offset then
        -- start bit sampled, time to enable data
        -- this should check RX here. if it =1, should revert to state 0
        rx_sig <= '0';
        rx_state <= 2;
        rx_data <= X"00";
        rx_sample_offset <= OFFSET_DATA_BITS; 
        rx_sample_count <= 0;
      end if;
    elsif rx_clk_baud_tick = '1' and rx_state >= 2  and rx_state < 10 then
      -- sampling data
      if rx_sample_count = rx_sample_offset then
        rx_data(6 downto 0) <= rx_data(7 downto 1);
        rx_data(7) <= I_rx;
        rx_sample_count <= 0;
        rx_state <= rx_state + 1;
      else
        rx_sample_count <= rx_sample_count + 1;
      end if;
    elsif rx_clk_baud_tick = '1' and rx_state = 10 then
      if rx_sample_count = OFFSET_STOP_BIT then
        rx_state <= 0;
        rx_sig <= '1';
        O_rxData <= rx_data; -- latch data out
        
        if I_rx = '1' then 
          -- stop bit correct
          rx_frameError <= '0';
        else
          -- stop bit is always high, if we don't see it, there 
          -- has been an issue. Signal an error.
          rx_frameError <= '1';
        end if;
      else
        rx_sample_count <= rx_sample_count + 1;
      end if;
    end if;
  end if;
end process;

The second half simply moves across all the states, sampling the start bit when our sample count gets to the offset required (for the start bit, 7). It then moves state to sampling the 8 data bits, with offsets of 16 a time, before checking the stop bit and indicating whether a frame error has occurred (the stop bit should be ‘1’).

Note that there is no real error checking, or input validation here. Technically, if rx_state=1 and our RX input is high, we should reset the RX system and go back to state 0 – as it was most likely a blip on the input, and not a real serial frame of data. I’ll probably add that later.

Other modules using this RX will use the O_rxSig output to indicate data received, and grab a new byte from the data output port. It stays high until a new frame begins receiving.

Testing the Receiver

Like the testing for transmit, I created a standard test bench, and filled it out with the usual content. For RX, I created a second clock with the same time period as the baud clock the UART is configured to use. I then have an array of the serial bitstream, and each rising edge of that clock I push the next bit across. You can see the whole test on github. Running it in the simulator, it works:

rx_tb_workingI also created a loopback test, which uses the transmitter test bench in two speeds, feeding the TX line into the RX to ensure the data is correct. I’ve got the waveform from a run of that test below, also zoomed into the area where 115200bps is active (at the start).

loopback_tbRunning On Hardware

Running on hardware is easy, just assign the RX and TX ports/nets to external pins. I created a loopback top-level module so I can type into a Putty serial session and see what I type echo back.

The loopback module for the hardware uses another state machine depending on the various UART module signals. The process is fairly simple, and will just push any input from the RX to the TX and set relevant states at the correct times.

loopback: process(I_clk_50mhz, O_rxSig)
begin
  if rising_edge(I_clk_50mhz) then
    if O_rxSig = '1' and last_msg_valid ='0' and new_message = '1' then
      last_msg <= O_rxData;
      last_msg_valid <= '1';
      new_message <= '0';
    elsif O_txRdy = '1' and I_txSig = '0' and last_msg_valid = '1' then
      I_txData <= last_msg;
      I_txSig <= '1';
    elsif I_txSig = '1' and O_txRdy = '0' and last_msg_valid = '1' then
      I_txSig <= '0';
      last_msg_valid <= '0';
    elsif O_rxSig = '0' then
      new_message <= '1';
    end if; 
  end if;
end process;

messSadly, I could not find my USB to Serial TTL converter in my lab mess. It’s in there somewhere. But I did find an old Teensy 3.1 (it’s actually from my TeensyZ80 build) which I used to forward serial to and from the miniSpartan6+. Keys I typed were echoed back, at 115200bps. So a successful test.

putty_loopbackspartan_teensyWrap Up

That pretty much finishes this post off. It’s by no means a finished implementation but works for what I need. I’ll be using it with TPU as a peripheral, and memory mapping the various ports as to control it from software. I think I’ll add some FIFO buffers to the input and output data lines to ensure I don’t loose data, implement the RX error checking I mentioned earlier, and also add a ‘number of frames’ counter for software-side error checking.

It should be made clear though, that there are probably UART constructs available within any recent FPGA what will take up much less resources than this, and they should be used in final projects where possible and sensible!

Thanks for reading, the code for this is available on github.

Teensy Z80 Homebrew Computer – Part 6 – Asynchronous Clocking Fail

This is the sixth part of a series of posts detailing steps required to get a simple Z80 based computer running, facilitated by a Teensy microcontroller. It’s a bit of fun, fuzing old and new hobbyist technologies. See Part 1, Part 2, Part 3, Part 4, and Part 5, if you’ve missed them.

Attempt 1

Making TeensyZ80 run with a faster, asynchronous clock seems a simple change at first, but it’s proving tricky. The high level plan is:

  1. The Clock signal is provided by another source (arduino nano at present)
  2. The MREQ and IOREQ lines are used to latch the WAIT line of the Z80 to allow the Teensy to respond to the request.
  3. The Teensy senses the WAIT line, performs any actions, and then resets the latch to bring the WAIT line high again (it’s active low).
  4. The Z80 continues as normal.

So with some simple 74 series logic, the MREQ and IOREQ pins are NAND’d together, producing a rising signal edge if either Z80 output go active low. This is fed into a 74HC74 flip flop as it’s clock, with the data pin tied logic high. This allows us to connect the Z80 WAIT input to the notQ output. The clear pin of the d-type flip flop is connected to the Teensy so it can reset it and allow the WAIT line to return high, letting the Z80 continue.

I had the Teensy set up to perform an interrupt routine on a falling edge of WAIT. Sadly, this didn’t seem to work. In fact, I could not confirm the interrupt was being called at all. I’ll have to look into this in detail but using interrupts really is an optimization in this case, so I soldiered on.

Teensy Rant

I’ve had several problems with Teensy microcontrollers during these posts. I had two units, one has completely bricked, and the other is very unstable. It seems to be due to the fact that if Pin 33 is low and an input when a program is uploaded to the Arm then the Mini54 chip can fail in some way. The Mini54 chip controls the bootloading process of uploading new code, so it effectively bricks the device. It is an issue that should really be given more prominence as if there was an announcement stating pin33 should never be used in certain ways I’d have two fully working Teensy devices. But sadly, all the documentation still states it as a fully configurable digital pin capable of input and output.

End Teensy Rant

Instead of using an interrupt, to try to get something working I created a tight loop() function that didn’t do anything while WAIT was high. As soon as it detected a low signal, it would perform the actions required. I disabled Z80 mode-2 interrupts for now, and removed the I/O debounce code. A very simple example seemed to work – but it was still quite slow, despite an arduino nano driving a clock at around 200KHz which is faster than what the Teensy was providing when running in synchronous mode.

I tried a larger example, one which printed text to the console, and it was obvious something was not right – the output slowly became corrupted. However, there were signs of promise. I was able to input a 4MHz clock and things were failing/corrupting in a somewhat similar way. Still corrupt, but it was the same behavior.

4mhzThe problem

The issue was that I failed to include the RD/WR lines from the Z80 in my latching circuit. You can see from the timing diagram that, especially in the write cases we need to WAIT when those are active too, not just MREQ or IOREQ.

timing_memoryAttempt 2

I redesigned the latch circuit.

wait_line_latchThis worked a lot better! I could only use the I/O port which put characters to the screen, but it was running well – and my simple test program, which printed “Welcome to TeensyZ80!” in an infinite loop, was stable even at 1MHz. I’d love to break the MHz barrier for this, but given we’re still on a breadboard and I don’t have a scope capable of inspecting this to the detail some of the issues require, I may need to settle for much less. So this simple test at 1MHz is very encouraging. I tried clocking it at 1.5MHz, but some artifacts in the printing arose.

latchcircuitThe previous design with I/O ports

When implementing my serial, display and filesystem devices which are accessed as I/O reads/writes, I created a system which relied on implied state behind the scenes on the Teensy. To set the colour of the characters being printed to the screen, two writes to the same port would write high and low values. It’s even worse for the serial device, where you had to write command packets to the I/O ports followed by a variable amount of data. I think i’m going to need to redesign all of the previous work, to operate on separate ports. For example, there will be a ColourHi and colourLow port which together define the 16-bit colour of the console. It’s not much work, but is something I’d overlooked and will take time.

This is a very quick update to Teensy Z80 work, It’s still very much ongoing. I’m also working on another project involving the miniSpartan6+ FPGA board. That’s another bit of fun – who doesn’t want to design their own processor?

Let me know any thoughts, as always, via twitter @domipheus !

Teensy Z80 Homebrew Computer – Part 5 – Implementing preemptive multithreading

This is the fifth part of a series of posts detailing steps required to get a simple Z80 based computer running, facilitated by a Teensy microcontroller. It’s a bit of fun, fuzing old and new hobbyist technologies. See Part 1, Part 2, Part 3 and Part 4 if you’ve missed them.

setupAt the moment, whilst running slowly due to the lock-step synchronous nature of the clock driving the Z80 from the Teensy, we do have a fairly well spec’d out little machine. So, in this fairly short post (it was short, then I went and implemented more than expected!), I thought I’d delve a bit into software, and in particular, multithreading.

Booting Teensy Z80, running C code

Before that, though, I wanted to share how the Teensy Z80 boots up, and how I am now using the Small Device C Compiler (sdcc) to compile my program code. The steps of how the Teensy initializes it’s ‘Z80 RAM’ and resets the Z80 to start executing code is as follows:

  1. The Teensy starts up with the Z80 unclocked. The Teensy has a global array to represent the Z80 RAM address space (there is no ROM). The Teensy has it’s Z80 RAM initialized to a small bootloader binary, which is assembled at offset 0h, and usually sets the stack pointer, defines some global data such as the interrupt vector table, and then jumps to a known location higher in memory.
  2. In the Teensy setup() routine, after mounting the SD card volume, tries to locate ‘kernel.bin’.
  3. If it’s found, it is loaded into the Z80 ram array at a known location. If it’s not loaded, the RAM remains in the initial state, which at the moment simply puts ‘?’ to the top left of the screen.
  4. The last thing the Teensy setup() routine does is reset the Z80 and start clocking it, so it starts executing from PC 0x0000 when the loop() routine starts running.
  5. The Z80 is now in control.

Previously the initial Z80 bootloader was the whole program. My simple shell example in the previous post was implemented this way, but it was tedious needing to recompile the Teensy code and re-upload the sketch every time I made a small code change. Now the ‘kernel.bin’ binary is compiled from C using sdcc.

A concern with SDCC is that I’ve yet to find comprehensive ABI details such as calling convensions and register use for it’s Z80 backend, so I’ve just had to play it by ear. Otherwise, it does have some really nice extensions so that ports can be represented by C variables:

__sfr __at 0x03 ioConsolePutChar;
__sfr __banked __at 0x07FFF ioVRAMBankDisable;

This is really useful, especially the __banked version that uses the 16-bit I/O as explaned in part 4. You can then use the names as though they were byte variables. Writing to the console is as simple as:

  ioConsolePutChar = 'H';
  ioConsolePutChar = 'i';
  ioConsolePutChar = '!';

I’ve been writing a TeensyZ80.h with all of the port definitions, but I’ve kept everything in a single C file for the following multithreading example. To build the binary, we simply compile without the crt, at code offset the same as the bootloader expects (0x800 in the following cases). SDCC generates an ihx file, which you need to convert to a binary with with hex2bin. Putting that in the root of the SD card as ‘kernel.bin’ runs it automatically.

Time-slice multithreading

The multithreading I want is time-slice multithreading, where different threads only run for a certain time called the time slice, before being preemptively swapped for another thread.

The high level idea is we have the Teensy fire an interrupt to the Z80 each ‘time slice interval’ and the interrupt handler will then context switch to a new thread. That should be all we need, really. We’ll use the same mode-2 Z80 interrupts as before. For this example all other interrupt vectors have been disabled.

Global State

We need some state stored globally. For our example we will assume a maximum of four threads. For each thread, we need to know the function it starts at, the arguments to that function, some flags, and a context containing the current running state. We throw all that in a struct, and make an array for our 4 possible threads. We will fix the main process thread as the first thread in this array. Global state like this is fine for this implementation. We can guarantee certain access patterns to ensure we don’t get any nasty race conditions, and define rules as to who owns and can write the thread structures to prevent locking requirements.

typedef char zthread_t;
typedef int (*startFunc_t) (void*);

typedef struct internal_thread_s {
  startFunc_t startFunc;
  void* arg;
  char flags;
  char active;
  unsigned short stack_start;
  internal_context_t ctx;
} internal_thread_t;

internal_thread_t threads[MAX_THREADS];
char num_threads;
char current_thread;

The main part of getting multithreading working in this style will be the interrupt handler which is fired every timeslice switch. The handler takes the following shape:

  • Disable interrupts
  • Save the current running state
  • Choose the next thread to run
  • Restore the state of the new thread
  • Enable interrupts
  • Return to the location where we were in the new thread to continue execution

Each thread, as you can see from the internal_thread_t structure above, has it’s own stack area, defined by stack_start. 256 bytes are reserved to each thread for their stack at fixed locations in Z80 RAM.To make things easier, the hl, bc, af, de, ix and iy registers will be pushed to the threads own stack as the context save. The stack pointer itself will be saved to the thread structure within the ctx field, though a write to a scratch memory location ( aka, ld (_stackLocationScratch), sp). The program counter itself does not need explicit saving, as it’s already on the stack. When an interrupt is signalled on the Z80 INT pin, after the current instruction has completed, the PC of the next instruction is placed on the stack, and then a vectored jump through the interrupt table lands you in the interrupt handler routine. We can use this fact to restore the PC incredibly simply, by just returning from the interrupt routine, with the stack pointer that of the new thread we want to execute.

The simplest interrupt routine, which will do round robin scheduling, and assumes 4 active threads, is listed below.

short stackLocationScratch;
void ihdr_timer_timeSlice( void ) __naked {
  // save state (PC is already on stack from interrupt ack
  __asm
    di
    push hl
    push bc
    push af
    push de
    push ix
    push iy
    ld (_stackLocationScratch), sp
    exx
    ex af, af'
  __endasm;

  // save stack to current thread ctx
  threads[current_thread].ctx.sp = stackLocationScratch;

  // Choose next thread to run (doesn't check
  // if they are in a running state)
  current_thread++;
  if (current_thread > MAX_THREADS) current_thread = 0;

  // load stack of next thread ctx
  stackLocationScratch = threads[current_thread].ctx.sp;

  // restore registers
  __asm
    ex af, af'
    exx
    ld sp, (_stackLocationScratch)
    pop iy
    pop ix
    pop de
    pop af
    pop bc
    pop hl
    ei
    reti
  __endasm;
}

The Z80 actually has two banks of registers internally. the exx instruction, along with the ex af, af’ instruction, swaps the current active bank. This is useful in case we needed lots of registers and wanted no stack, but not essential here. If the code to choose the next thread was any more complex, we would need to load in a stack pointer into the sp register for use in kernel routines, as to not use up the thread stack – which may be nearly full. The restore registers body of code is the mirror of the state save, so the reti instruction should find the PC on the stack that is correct for the thread we have swapped to, as that thread itself, upon entry to the interrupt routine, would have had it’s PC pushed to stack.

Starting Threads

Using this makes starting threads rather easy. When we create a thread, it’s flagged as ZTHREAD_NOT_STARTED, so it’s not selected in the scheduler within the interrupt handler. When the zthread_start function is called, we know the first time the thread can actually be started is when it’s selected within the interrupt handler. Looking at the handler, and how the restore of state for a thread is performed, we can construct the stack of this thread to make it look as though it was preempted exactly at the entry to the start function.

Knowing this, before setting the thread as ZTHREAD_RUNNING, if we populate the stack locations of the thread as per the table below, we can let the interrupt handler take care of the rest!

stackPreparing the thread_t structure within the zthread_start function then looks like:

  threads[handle].ctx.sp = ((unsigned short)threads[handle].stack_start)-18;
  stack = (short*)threads[handle].ctx.sp;
  stack[0] = 0; //  pop iy
  stack[1] = 0; //  pop ix
  stack[2] = 0; //  pop de
  stack[3] = 0; //  pop af
  stack[4] = 0; //  pop bc
  stack[5] = 0; //  pop hl
  stack[6] = (short)threads[handle].startFunc;
  stack[7] = (short)_TZL_thread_exited;
  stack[8] = (short)threads[handle].arg;
  threads[handle].flags = ZTHREAD_RUNNING;

With this set up, when our thread is selected to run by the round robin scheduler within the interrupt routine, the registers will all be set to 0, and then the return from interrupt instruction will load startFunc into PC for the next instruction fetch. From here, the calling conventions dictate the return PC is next on the stack, followed by the function arguments. Therefore when startFunc() returns, we will load the _TZL_thread_exited() function address into the PC, to begin the thread exit logic. At this moment, we can just ignore that function, and try out what happens if we launch some threads which simply print characters.

int startFunc_print(void* args) {
  char c = (char)args;
  while (1) {
    con_putChar(c);
  }
}

int main( int argc, char* argv[] ) {

  zthread_t threadA;
  zthread_t threadB;

  zthread_create(&threadA, startFunc_print, (char*)(short)'A');
  zthread_create(&threadB, startFunc_print, (char*)(short)'B');

  zthread_start(threadA);
  zthread_start(threadB);

  while(1) {
    con_putChar('M');
  }

  return 0;
}

As our thread ID 0 is fixed to the ‘main thread’ we will have that function begin by calling main. We simply make a special case of this thread, and set it up manually before calling directly into the thread, after enabling interrupts. By registering our interrupt handler at a vector which the Teensy fires every few hundred milliseconds, enabling interrupts starts the scheduler. Hundreds of milliseconds is a very long timeslice, but TeensyZ80 is running very slowly in a synchronous clock mode, so it’s only running itself at tens of kilohertz. A larger timeslice allows us to also see what is happening much more clearly. (for this video, the scheduler assumes only 3 threads)

Thread Joining

Joining threads is a basic operation that must be supported. Joining is the act of suspending one thread until another has completed or exited. We can implement this in a very simple way, by having a ZTHREAD_WAIT_JOIN state, in which the thread will not be scheduled to run, and then when other threads exit, we can check in the _TZL_thread_exited() function if threads exist in a wait state that are waiting for the thread that has just completed. If we find threads that have the ZTHREAD_WAIT_JOIN flag, with state_data set to our zthread_t handle, we can set their flag to be runnable, and clear the state_data.

void _TZL_thread_exited( void ) {
  char idx = 0;
  zthread_t thisThread = zthread_getThread();

  // if any threads are joining to us, tell them they can
  // continue now
  for (; idx < MAX_THREADS; idx++) {
    if ((threads[idx].flags == ZTHREAD_WAIT_JOIN)
      && (threads[idx].state_data == thisThread)) {
      threads[idx].flags = ZTHREAD_RUNNING;
      threads[idx].state_data = 0;
    }
  }

  // For now, just set the flag as free.
  // Really we should set as exited and we can then
  // look to get any return value.
  threads[thisThread].flags = ZTHREAD_HDL_FREE;

  // this thread ends here. halt so we can be swapped out.
  while (1) {
    __asm
      halt
    __endasm;
  }
}

Halting the Z80 means that no code will run until the timeslice interrupt fires. It’s placed in a while(1) block in case another interrupt which is not for the scheduler is fired. we do not encounter this in our example, though.

A side effect of this is now we have waiting, we can deadlock by having two threads join to each other. We can actually check for this directly in the join() call, but there can be chains that are harder to decipher. We can add code to the scheduler that detects when there are no threads available to run, and signal a deadlock.

  // Choose next thread to run
  thread_schedule_counter = 0;
  do {
    thread_schedule_counter++;
    current_thread++;
    if (current_thread >= MAX_THREADS) {
      current_thread = 0;
    }
  } while ((threads[current_thread].flags != ZTHREAD_RUNNING)
    && (thread_schedule_counter <= MAX_THREADS));

  if (thread_schedule_counter > MAX_THREADS) {
    // swap to the kernel stack for this
    __asm
      ; load the stack pointer to the kernel stack
      ld sp, #0x07F0
    __endasm;

    panic_deadlock();
  }

The panic_deadlock() function can print a message to the user along with some state about each thread for easy debugging. Note the stack is modified to be at a safe known location as the thread stacks may not have enough size left in them to call the panic function, and also we may want to debug them at a later date, so it’s best to leave them unchanged. The complete join function is below.

int zthread_join(zthread_t handle) {
  zthread_t thisThread = zthread_getThread();
  if (threads[thisThread].flags != ZTHREAD_RUNNING) {
    return ZTHREAD_THREAD_NOT_RUNNING;
  }

  // if the thread we want to join with is marked
  // as free, assume it's already exited and so
  // return. This should be the exited flag, really
  if (threads[handle].flags == ZTHREAD_HDL_FREE) {
    return 0;
  }

  if (threads[handle].flags != ZTHREAD_RUNNING) {
    if (threads[handle].flags != ZTHREAD_WAIT_JOIN) {
      return ZTHREAD_THREAD_NOT_RUNNING;
    }
  }

  threads[thisThread].state_data = handle;
  threads[thisThread].flags = ZTHREAD_WAIT_JOIN;

  __asm
    halt
  __endasm;

  return 0;
}

 

Critical Sections

There will be times that we do not want other threads to run, or when we are manipulating multiple bytes of data. Examples of this are writing to the screen, setting colour and the row/column we are writing to. Those functions are not thread safe. The join function, too, may be better within a critical section, except from the halt at the end. This is to ensure all threads have updated and consistent state before they have a chance to run. On the Z80, byte writes will actually be atomic, as the interrupt pin is only sampled after a whole operation has completed.

Critical sections can be implemented very easily: we simply disable interrupts for the duration we need. This will stop all other threads running and stop things that depend on interrupts, so we need to account for that, but it’s easy to add and perfectly fine for this use case.

The end result

We have thread_create, thread_start, thread_join, the ability to create critical sections, and a round robin scheduler. The test below, runs as to the video (apologies for shaky-cam!).

int startFunc_print2(void* args) {
  char c = (char)args;
  short num = 400;
  while (num--) {
    con_putChar(c);
  }
  return 0;
}

int startFunc_print_deadlock(void* args) {
  char c = (char)args;
  char num = 140;
  while (num--) {
    con_putChar(c);
  }

  // main thread always id 0
  ASSERT(! zthread_join(0));
  return 0;
}

int main( int argc, char* argv[] ) {
  zthread_t threadA;
  zthread_t threadB;
  zthread_t threadC;

  argc;
  argv;

  ASSERT(! zthread_create(&threadA, startFunc_print2, (char*)(short)'A'));
  ASSERT(! zthread_create(&threadB, startFunc_print2, (char*)(short)'B'));
  ASSERT(! zthread_create(&threadC, startFunc_print2, (char*)(short)'C'));

  ASSERT(! zthread_start(threadA));
  ASSERT(! zthread_start(threadB));
  ASSERT(! zthread_start(threadC));

  ASSERT(! zthread_join(threadA));
  ASSERT(! zthread_join(threadB));
  ASSERT(! zthread_join(threadC));

  con_putString(" Thread A,B & C has exited, main thread can continue to deadlock detection test! ");

  ASSERT(! zthread_create(&threadA, startFunc_print_deadlock, (char*)(short)'!'));
  ASSERT(! zthread_start(threadA));

  startFunc_print2((char*)(short)'P');

  ASSERT(! zthread_join(threadA));

  while(1) {
    con_putChar('M');
  }

  return 0;
}

Things we would want implemented next are true exiting of the threads, with return value capture. I’d call that a good enough implementation for Teensy Z80. I don’t think I’ll be making much use of threads in anything I write for this, especially given the current speed of the system. The next thing on my to-do list is to get Teensy Z80 faster.

Code as always is on my github. I hope you’ve been enjoying this Teensy Z80 project. If you have, let me know on twitter @domipheus!