Content follows this message
If you have enjoyed my articles, please consider these charities for donation:

Young Lives vs Cancer - Donate.
Blood Cancer UK - Donate.
Children's Cancer and Leukaemia Group - Donate.

Teensy Z80 Homebrew Computer – Part 5 – Implementing preemptive multithreading

Posted Feb 1, 2015, Reading time: 13 minutes.

This is the fifth part of a series of posts detailing steps required to get a simple Z80 based computer running, facilitated by a Teensy microcontroller. It’s a bit of fun, fuzing old and new hobbyist technologies. See Part 1, Part 2, Part 3 and Part 4 if you’ve missed them.

At the moment, whilst running slowly due to the lock-step synchronous nature of the clock driving the Z80 from the Teensy, we do have a fairly well spec’d out little machine. So, in this fairly ~~short post~~ (it was short, then I went and implemented more than expected!), I thought I’d delve a bit into software, and in particular, multithreading.

Booting Teensy Z80, running C code

Before that, though, I wanted to share how the Teensy Z80 boots up, and how I am now using the Small Device C Compiler (sdcc) to compile my program code. The steps of how the Teensy initializes it’s ‘Z80 RAM’ and resets the Z80 to start executing code is as follows:

The Teensy starts up with the Z80 unclocked. The Teensy has a global array to represent the Z80 RAM address space (there is no ROM). The Teensy has it’s Z80 RAM initialized to a small bootloader binary, which is assembled at offset 0h, and usually sets the stack pointer, defines some global data such as the interrupt vector table, and then jumps to a known location higher in memory.
In the Teensy setup() routine, after mounting the SD card volume, tries to locate ‘kernel.bin’.
If it’s found, it is loaded into the Z80 ram array at a known location. If it’s not loaded, the RAM remains in the initial state, which at the moment simply puts ‘?’ to the top left of the screen.
The last thing the Teensy setup() routine does is reset the Z80 and start clocking it, so it starts executing from PC 0x0000 when the loop() routine starts running.
The Z80 is now in control.

Previously the initial Z80 bootloader was the whole program. My simple shell example in the previous post was implemented this way, but it was tedious needing to recompile the Teensy code and re-upload the sketch every time I made a small code change. Now the ‘kernel.bin’ binary is compiled from C using sdcc.

A concern with SDCC is that I’ve yet to find comprehensive ABI details such as calling convensions and register use for it’s Z80 backend, so I’ve just had to play it by ear. Otherwise, it does have some really nice extensions so that ports can be represented by C variables:

__sfr __at 0x03 ioConsolePutChar;
__sfr __banked __at 0x07FFF ioVRAMBankDisable;

This is really useful, especially the __banked version that uses the 16-bit I/O as explaned in part 4. You can then use the names as though they were byte variables. Writing to the console is as simple as:

ioConsolePutChar = 'H';
  ioConsolePutChar = 'i';
  ioConsolePutChar = '!';

I’ve been writing a TeensyZ80.h with all of the port definitions, but I’ve kept everything in a single C file for the following multithreading example. To build the binary, we simply compile without the crt, at code offset the same as the bootloader expects (0x800 in the following cases). SDCC generates an ihx file, which you need to convert to a binary with with hex2bin. Putting that in the root of the SD card as ‘kernel.bin’ runs it automatically.

Time-slice multithreading

The multithreading I want is time-slice multithreading, where different threads only run for a certain time called the time slice, before being preemptively swapped for another thread.

The high level idea is we have the Teensy fire an interrupt to the Z80 each ‘time slice interval’ and the interrupt handler will then context switch to a new thread. That should be all we need, really. We’ll use the same mode-2 Z80 interrupts as before. For this example all other interrupt vectors have been disabled.

Global State

We need some state stored globally. For our example we will assume a maximum of four threads. For each thread, we need to know the function it starts at, the arguments to that function, some flags, and a context containing the current running state. We throw all that in a struct, and make an array for our 4 possible threads. We will fix the main process thread as the first thread in this array. Global state like this is fine for this implementation. We can guarantee certain access patterns to ensure we don’t get any nasty race conditions, and define rules as to who owns and can write the thread structures to prevent locking requirements.

typedef char zthread_t;
typedef int (*startFunc_t) (void*);

typedef struct internal_thread_s {
  startFunc_t startFunc;
  void* arg;
  char flags;
  char active;
  unsigned short stack_start;
  internal_context_t ctx;
} internal_thread_t;

internal_thread_t threads[MAX_THREADS];
char num_threads;
char current_thread;

The main part of getting multithreading working in this style will be the interrupt handler which is fired every timeslice switch. The handler takes the following shape:

Disable interrupts
Save the current running state
Choose the next thread to run
Restore the state of the new thread
Enable interrupts
Return to the location where we were in the new thread to continue execution

Each thread, as you can see from the internal_thread_t structure above, has it’s own stack area, defined by stack_start. 256 bytes are reserved to each thread for their stack at fixed locations in Z80 RAM.To make things easier, the hl, bc, af, de, ix and iy registers will be pushed to the threads own stack as the context save. The stack pointer itself will be saved to the thread structure within the ctx field, though a write to a scratch memory location ( aka, ld (_stackLocationScratch), sp). The program counter itself does not need explicit saving, as it’s already on the stack. When an interrupt is signalled on the Z80 INT pin, after the current instruction has completed, the PC of the next instruction is placed on the stack, and then a vectored jump through the interrupt table lands you in the interrupt handler routine. We can use this fact to restore the PC incredibly simply, by just returning from the interrupt routine, with the stack pointer that of the new thread we want to execute.

The simplest interrupt routine, which will do round robin scheduling, and assumes 4 active threads, is listed below.

short stackLocationScratch;
void ihdr_timer_timeSlice( void ) __naked {
  // save state (PC is already on stack from interrupt ack
  __asm
    di
    push hl
    push bc
    push af
    push de
    push ix
    push iy
    ld (_stackLocationScratch), sp
    exx
    ex af, af'
  __endasm;

  // save stack to current thread ctx
  threads[current_thread].ctx.sp = stackLocationScratch;

  // Choose next thread to run (doesn't check
  // if they are in a running state)
  current_thread++;
  if (current_thread > MAX_THREADS) current_thread = 0;

  // load stack of next thread ctx
  stackLocationScratch = threads[current_thread].ctx.sp;

  // restore registers
  __asm
    ex af, af'
    exx
    ld sp, (_stackLocationScratch)
    pop iy
    pop ix
    pop de
    pop af
    pop bc
    pop hl
    ei
    reti
  __endasm;
}

The Z80 actually has two banks of registers internally. the exx instruction, along with the ex af, af’ instruction, swaps the current active bank. This is useful in case we needed lots of registers and wanted no stack, but not essential here. If the code to choose the next thread was any more complex, we would need to load in a stack pointer into the sp register for use in kernel routines, as to not use up the thread stack – which may be nearly full. The restore registers body of code is the mirror of the state save, so the reti instruction should find the PC on the stack that is correct for the thread we have swapped to, as that thread itself, upon entry to the interrupt routine, would have had it’s PC pushed to stack.

Starting Threads

Using this makes starting threads rather easy. When we create a thread, it’s flagged as ZTHREAD_NOT_STARTED, so it’s not selected in the scheduler within the interrupt handler. When the zthread_start function is called, we know the first time the thread can actually be started is when it’s selected within the interrupt handler. Looking at the handler, and how the restore of state for a thread is performed, we can construct the stack of this thread to make it look as though it was preempted exactly at the entry to the start function.

Knowing this, before setting the thread as ZTHREAD_RUNNING, if we populate the stack locations of the thread as per the table below, we can let the interrupt handler take care of the rest!

Preparing the thread_t structure within the zthread_start function then looks like:

threads[handle].ctx.sp = ((unsigned short)threads[handle].stack_start)-18;
  stack = (short*)threads[handle].ctx.sp;
  stack[0] = 0; //  pop iy
  stack[1] = 0; //  pop ix
  stack[2] = 0; //  pop de
  stack[3] = 0; //  pop af
  stack[4] = 0; //  pop bc
  stack[5] = 0; //  pop hl
  stack[6] = (short)threads[handle].startFunc;
  stack[7] = (short)_TZL_thread_exited;
  stack[8] = (short)threads[handle].arg;
  threads[handle].flags = ZTHREAD_RUNNING;

With this set up, when our thread is selected to run by the round robin scheduler within the interrupt routine, the registers will all be set to 0, and then the return from interrupt instruction will load startFunc into PC for the next instruction fetch. From here, the calling conventions dictate the return PC is next on the stack, followed by the function arguments. Therefore when startFunc() returns, we will load the _TZL_thread_exited() function address into the PC, to begin the thread exit logic. At this moment, we can just ignore that function, and try out what happens if we launch some threads which simply print characters.

int startFunc_print(void* args) {
  char c = (char)args;
  while (1) {
    con_putChar(c);
  }
}

int main( int argc, char* argv[] ) {

  zthread_t threadA;
  zthread_t threadB;

  zthread_create(&threadA, startFunc_print, (char*)(short)'A');
  zthread_create(&threadB, startFunc_print, (char*)(short)'B');

  zthread_start(threadA);
  zthread_start(threadB);

  while(1) {
    con_putChar('M');
  }

  return 0;
}

As our thread ID 0 is fixed to the ‘main thread’ we will have that function begin by calling main. We simply make a special case of this thread, and set it up manually before calling directly into the thread, after enabling interrupts. By registering our interrupt handler at a vector which the Teensy fires every few hundred milliseconds, enabling interrupts starts the scheduler. Hundreds of milliseconds is a very long timeslice, but TeensyZ80 is running very slowly in a synchronous clock mode, so it’s only running itself at tens of kilohertz. A larger timeslice allows us to also see what is happening much more clearly. (for this video, the scheduler assumes only 3 threads)

Thread Joining

Joining threads is a basic operation that must be supported. Joining is the act of suspending one thread until another has completed or exited. We can implement this in a very simple way, by having a ZTHREAD_WAIT_JOIN state, in which the thread will not be scheduled to run, and then when other threads exit, we can check in the _TZL_thread_exited() function if threads exist in a wait state that are waiting for the thread that has just completed. If we find threads that have the ZTHREAD_WAIT_JOIN flag, with state_data set to our zthread_t handle, we can set their flag to be runnable, and clear the state_data.

void _TZL_thread_exited( void ) {
  char idx = 0;
  zthread_t thisThread = zthread_getThread();

  // if any threads are joining to us, tell them they can
  // continue now
  for (; idx < MAX_THREADS; idx++) {
    if ((threads[idx].flags == ZTHREAD_WAIT_JOIN)
      && (threads[idx].state_data == thisThread)) {
      threads[idx].flags = ZTHREAD_RUNNING;
      threads[idx].state_data = 0;
    }
  }

  // For now, just set the flag as free.
  // Really we should set as exited and we can then
  // look to get any return value.
  threads[thisThread].flags = ZTHREAD_HDL_FREE;

  // this thread ends here. halt so we can be swapped out.
  while (1) {
    __asm
      halt
    __endasm;
  }
}

Halting the Z80 means that no code will run until the timeslice interrupt fires. It’s placed in a while(1) block in case another interrupt which is not for the scheduler is fired. we do not encounter this in our example, though.

A side effect of this is now we have waiting, we can deadlock by having two threads join to each other. We can actually check for this directly in the join() call, but there can be chains that are harder to decipher. We can add code to the scheduler that detects when there are no threads available to run, and signal a deadlock.

// Choose next thread to run
  thread_schedule_counter = 0;
  do {
    thread_schedule_counter++;
    current_thread++;
    if (current_thread >= MAX_THREADS) {
      current_thread = 0;
    }
  } while ((threads[current_thread].flags != ZTHREAD_RUNNING)
    && (thread_schedule_counter <= MAX_THREADS));

  if (thread_schedule_counter > MAX_THREADS) {
    // swap to the kernel stack for this
    __asm
      ; load the stack pointer to the kernel stack
      ld sp, #0x07F0
    __endasm;

    panic_deadlock();
  }

The panic_deadlock() function can print a message to the user along with some state about each thread for easy debugging. Note the stack is modified to be at a safe known location as the thread stacks may not have enough size left in them to call the panic function, and also we may want to debug them at a later date, so it’s best to leave them unchanged. The complete join function is below.

int zthread_join(zthread_t handle) {
  zthread_t thisThread = zthread_getThread();
  if (threads[thisThread].flags != ZTHREAD_RUNNING) {
    return ZTHREAD_THREAD_NOT_RUNNING;
  }

  // if the thread we want to join with is marked
  // as free, assume it's already exited and so
  // return. This should be the exited flag, really
  if (threads[handle].flags == ZTHREAD_HDL_FREE) {
    return 0;
  }

  if (threads[handle].flags != ZTHREAD_RUNNING) {
    if (threads[handle].flags != ZTHREAD_WAIT_JOIN) {
      return ZTHREAD_THREAD_NOT_RUNNING;
    }
  }

  threads[thisThread].state_data = handle;
  threads[thisThread].flags = ZTHREAD_WAIT_JOIN;

  __asm
    halt
  __endasm;

  return 0;
}

Critical Sections

There will be times that we do not want other threads to run, or when we are manipulating multiple bytes of data. Examples of this are writing to the screen, setting colour and the row/column we are writing to. Those functions are not thread safe. The join function, too, may be better within a critical section, except from the halt at the end. This is to ensure all threads have updated and consistent state before they have a chance to run. On the Z80, byte writes will actually be atomic, as the interrupt pin is only sampled after a whole operation has completed.

Critical sections can be implemented very easily: we simply disable interrupts for the duration we need. This will stop all other threads running and stop things that depend on interrupts, so we need to account for that, but it’s easy to add and perfectly fine for this use case.

The end result

We have thread_create, thread_start, thread_join, the ability to create critical sections, and a round robin scheduler. The test below, runs as to the video (apologies for shaky-cam!).

int startFunc_print2(void* args) {
  char c = (char)args;
  short num = 400;
  while (num--) {
    con_putChar(c);
  }
  return 0;
}

int startFunc_print_deadlock(void* args) {
  char c = (char)args;
  char num = 140;
  while (num--) {
    con_putChar(c);
  }

  // main thread always id 0
  ASSERT(! zthread_join(0));
  return 0;
}

int main( int argc, char* argv[] ) {
  zthread_t threadA;
  zthread_t threadB;
  zthread_t threadC;

  argc;
  argv;

  ASSERT(! zthread_create(&threadA, startFunc_print2, (char*)(short)'A'));
  ASSERT(! zthread_create(&threadB, startFunc_print2, (char*)(short)'B'));
  ASSERT(! zthread_create(&threadC, startFunc_print2, (char*)(short)'C'));

  ASSERT(! zthread_start(threadA));
  ASSERT(! zthread_start(threadB));
  ASSERT(! zthread_start(threadC));

  ASSERT(! zthread_join(threadA));
  ASSERT(! zthread_join(threadB));
  ASSERT(! zthread_join(threadC));

  con_putString(" Thread A,B & C has exited, main thread can continue to deadlock detection test! ");

  ASSERT(! zthread_create(&threadA, startFunc_print_deadlock, (char*)(short)'!'));
  ASSERT(! zthread_start(threadA));

  startFunc_print2((char*)(short)'P');

  ASSERT(! zthread_join(threadA));

  while(1) {
    con_putChar('M');
  }

  return 0;
}

Things we would want implemented next are true exiting of the threads, with return value capture. I’d call that a good enough implementation for Teensy Z80. I don’t think I’ll be making much use of threads in anything I write for this, especially given the current speed of the system. The next thing on my to-do list is to get Teensy Z80 faster.

Code as always is on my github. I hope you’ve been enjoying this Teensy Z80 project. If you have, let me know on twitter @domipheus!

Domipheus Labs

Stuff that interests Colin ‘Domipheus’ Riley