Designing a CPU in VHDL, Part 10: Interrupts and Xilinx block RAMs

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

Part 10 was supposed to be a very big part, with a special surprise of TPU working with a cool peripheral device, but that work is still ongoing. It’s taking a long time to do, mostly due to being busy myself over the past few weeks. However, in this update, I’ll look at bringing interrupts to TPU, as well as fixing an issue with the embedded ram that was causing bloating of the synthesized design.


Interrupts are needed on a CPU which is expected to work with multiple asynchronous devices whilst also doing some other computation. You can always have the CPU poll, but sometimes that isn’t wise and/or suitable given other constraints. It’s also good for keeping time with something – vsync, for example. This is where interrupts come in – where a signal fed to the CPU externally can “interrupt” what the CPU is currently executing, and perform some other computation before returning to it’s previous task.

The way I have implemented the interrupts is similar to the Z80 maskable interrupts, with an external interrupt input and an interrupt acknowledge output. The system is simplified and doesn’t have the different types of modes and non-maskable interrupts available on the Z80 but it should be enough for the needs of TPU. You can only handle a single request at a time, and there is only one mode to work with – but it’s powerful enough for most situations.

An overview of how the interrupts will work are as follows:

  • At some point during execution, the system will make the interrupt input to TPU high, indicating they want the interrupt handler run.
  • At the next writeback stage of the pipeline, just before migrating to the fetch stage, the interrupt input is sampled.
  • If an interrupt is requested, the control unit will then make the interrupt acknowledge output from TPU active.
  • Once the interrupt ACK signal is seen externally to TPU, 16-bits of data can be placed on the data input to TPU.
  • After a predetermined number of cycles, the bits on the data in bus are stored.
  • The ACK is de-asserted, and the PC of TPU is set to the interrupt handler.
  • The handler can retrieve the data from the data bus via a new instruction, and also return to the previous PC before the interrupt was acknowledged.
  • The external interrupt input is latched, so until it goes inactive for a cycle, remaining active will not invoke another interrupt handler invocation.

It’s very important that the interrupt input is only acted upon during the end of the writeback stage. Doing it at any other point can result in an inconsistent execution state, whereby we do not know if the current instruction has executed to completion. Doing the interrupt at the end of a writeback means:

  1. the PC we save (to return to later) is already the ‘next’ PC, be that prev_pc+2, or a branch target;
  2. memory reads have had time to complete successfully; and
  3. any registers have had time to see and act upon write enable signals to store data.

The items that are needed, therefore, are:

  • Internal registers for the stored PC (to return to after interrupt handler), the interrupt data field passed on the data in bus, and an interrupt enable bit
  • Various connections between the parts of the sub-modules for handling storing of the PC and interrupt data
  • Control unit additions for the interrupt handler step
  • New instructions for getting interrupt data and returning from an interrupt

Internal registers & Connections

I added a 16-bit register for the ‘next PC’ and also the ‘interrupt data’ to the ALU itself, rather than adding it to the register file. There are individual set/write control lines and also data lines for them into the ALU. It’s a bit messy and adds a lot of ports to the ALU and control unit, but it worked and I can change this later if I want to tidy things up. Having the registers part of the ALU makes the instructions that access them incredibly simple and self contained.

Control unit additions

The control unit now has an interrupt state, all of the control signals for setting the registers in the ALU and also the logic for managing the phases of calling into the interrupt handler. If interrupts are enabled, the interrupt input is active and it’s the end of the writeback phase, the following occurs:

  1. Interrupt_ack is activated
  2. A cycle of latency is provided
  3. The bits on the data in bus are sampled and the ALU instructed to store this value
  4. The current PC (which is, at this point, the next instruction to execute) is saved by the ALU
  5. The PC unit sets the current PC to the interrupt vector, currently fixed at 0x0008.
  6. The control unit resets it’s interrupt state, and proceeds to the fetch stage of the pipeline.

At the moment, interrupts are not disabled automatically when the handler is invoked, so the first instruction must be a disable interrupt instruction.

New Instructions

There are four new instructions used to manage and handle interrupts.

giefThe Get Interrupt Event Field transfers the value on the data bus at the time after an interrupt acknowledge into a register for further use. Using this value, we can work out what caused the interrupt and perform further actions from that point. An example of this is using it with a UART, the interrupt data field could contain the uart identifier in the high 8 bits, and the byte of data which was received in the lower 8 bits.

bbiBranch back from Interrupt is similar to the reti instruction in the Z80. It branches back to the PC value which was due to be fetched next before the interrupt handler was invoked.

eiThe enable and disable interrupt instructions are fairly obvious.

The interrupt vector

The interrupt vector is fixed at address 0x0008. The shape of the interrupt handler should be something like the following:

  1. disable interrupts
  2. Save all registers
  3. get the interrupt event data field
  4. Perform action according to interrupt event field, or add the field data to a queue for later processing.
  5. restore all registers
  6. enable interrupts
  7. Branch back to ‘normal’ code.

Saving the registers can be done by saving to the current stack and then restoring before returning from the handler. I’ve been using r7 as a ‘standard’ stack pointer in our very ad-hoc ABI spec, so this can be done. This does use user stack, though, so it needs taken into account if stack space is a particular concern.

There are a few issues that could occur, mainly in timing between disabling and enabling the interrupts. There could be a new interrupt to be handled when the enable interrupts instruction is processed, and this interrupt will then be accepted before the bbi instruction to branch back. This will destroy the original PC value when the original interrupt was raised, so I will probably change things around. There are a few solutions to this, one being that interrupts are by definition disabled when the branch to the interrupt vector occurs, and then a bbi instruction implicitly turns interrupts on again. I’ll need to have a think about the best course of action for this.

The makeup of the test interrupt routines I’ve had are like the following (snipped for clarity)

  load.h  r7, 0x08
  subi    r7, r7, 4
  bi      $start
  dw      0x0000
intvec:   #interrupt vector 0x8
  # save the registers
  gief    r0
  #    inspect r0 for interrupt type
  #    branch to some other work
  # restore the registers
  load.l  r0, 0

The interrupt handler, whilst a bit messy in it’s implementation, works well in simulation. I’ve yet to use it when TPU is running on the FPGA with an external source, but I do not foresee many issues other than the one stated above.

A Look in the simulator

interrupts_waveform_numberedThe above waveform is showing an interrupt being flagged on a UART receive event, the event field containing the UART ID (1) and the byte value received (0x4f). Walking through the waveform, we get the following:

  1. The UART has received a byte and signaled this.
  2. An interrupt is immediately raised.
  3. Several cycles later the ACK is signaled by the cpu
  4. The interrupt event field(IEF) data is placed on the data in bus after a cycle of delay
  5. The ACKis de-signaled, and the IEF is removed from data in bus and saved internally (to later be used via the gief instruction)
  6. The CPU branches to the interrupt vector 0x0008, requesting the instruction from memory

The internal RAM

I mentioned previously that the design resources had shot up, and it turns out this is due mainly to the internal ram not being synthesized as a block ram. I was getting an internal compiler error in the Xilinx toolchain when building the existing ram with a larger capacity (I think it was 512bytes at this point) and to counter this I re-implemented the ram in another way. The way I did it, though, added an asynchronous element which in turn forced the toolchain to implement the RAM via look up tables, instead of utilizing the block ram. This is why there was a jump in resource requirements when using the Spartan6.

Block Rams

I could not get around the internal compiler error without an async element, so off to the documentation for the spartan6 I went. Turns out there is a document specifically on the block rams available on the device I have.

The block rams are used by initializing a generic object in VHDL to various constants, and then interfacing with the ports that object exposes. There are two kinds of block rams available, but I decided to use the 18 kilobit, dual-port one: RAMB16BWER. It is made up of 16Kb for data and 2Kb for parity. ISE has a nice template library for instantiation of primitives, and the block ram I use is included. It can be found within Edit->Language Templates, and then within the VHDL->Device Primitives->Spartan6->RAM/ROM.

lang_templatesThis brings up a window with initialization code to copy and paste into your own design. I took it, and edited the relevant areas to configure it for a 16-bit addressed memory.

Despite having the existing integrated ram address bytes explicitly, I decided against that with the block ram and instead addressed 16-bit values. To the TPU programmer, it still addresses bytes, but internally, it’s really stored at 16-bit, 2 byte blocks. The main reason for this was latency and complexity. By addressing 16-bit values internally in the block ram, I can implement both 16-byte reads/writes and also 8-bit reads and writes using a single port. The RAMB16BWER has a byte-wise write enable, so I can write either the high or low 8bits of a memory location internal to the block ram, leaving the other half untouched. There is one issue that arises from this method – an unaligned 16-bit read/write (i.e, the address being odd) will result in incorrect behavior. At the moment nothing happens if you try this, but I intend to add a trap/exception. I could maybe invoke the interrupt handler with a known interrupt event field value to specify an unaligned memory operation.


There were several gotchas I encountered whilst trying the block ram with a testbench. The addressing scheme, first of all, was confusing. As the generic component was initialized with relevant 16-bit addressing (18bit when you include parity), I assumed it would transform the address itself into the correct form. This did not seem to be the case after running the test bench. the documentation has a table of mappings and also a formula, but in the end it only took a few minutes of inspection in the simulator to work out what was happening.

blockramaddressThe next issue was a rather silly affair! The initialization attributes for the block ram are from most-significant to least-significant order. Due to this, 16-bit instructions need byte-flipped when read in the code, and also, they go from right to left along the initialization attribute.

-- BEGIN TASM RAMB16BWER INIT OUTPUT                                         
INIT_00 => X"06831180E27F00300000004F4C4C454801E102E100EF03E100000CC1E91E088E",

Maps to the instruction forms (only first 3 instructions shown):

X"8E", X"08", -- 0000: load.h  r7 0x08
X"1E", X"E9", -- 0002: subi    r7 r7 4
X"C1", X"0C", -- 0004: bi      0x0018

I will not admit the amount of time spent trying to figure out the issue of byte flipping in the initialization attribute 😉

The least significant digit of the address, specifying the high/low byte of the 16-bit memory location, is managed in the VHDL process. Ive put that process (and other relevant signal operations) below for clarity. It’s a large block of text even without some of the less important generic attributes/initializations, which I have omitted.

 generic map (
    -- DATA_WIDTH_A/DATA_WIDTH_B: 0, 1, 2, 4, 9, 18, or 36
    DATA_WIDTH_A => 18,
    DATA_WIDTH_B => 18,
    -- SIM_COLLISION_CHECK: Collision check enable "ALL", "WARNING_ONLY", "GENERATE_X_ONLY" or "NONE" 
    -- SIM_DEVICE: Must be set to "SPARTAN6" for proper simulation behavior
 port map (
    -- Port A Data: 32-bit (each) output: Port A data
    DOA => DOA,       -- 32-bit output: A port data output
    DOPA => DOPA,     -- 4-bit output: A port parity output
    -- Port B Data: 32-bit (each) output: Port B data
    DOB => DOB,       -- 32-bit output: B port data output
    DOPB => DOPB,     -- 4-bit output: B port parity output
    -- Port A Address/Control Signals: 14-bit (each) input: Port A address and control signals
    ADDRA => ADDRA,   -- 14-bit input: A port address input
    CLKA => CLKA,     -- 1-bit input: A port clock input
    ENA => ENA,       -- 1-bit input: A port enable input
    REGCEA => REGCEA, -- 1-bit input: A port register clock enable input
    RSTA => RSTA,     -- 1-bit input: A port register set/reset input
    WEA => WEA,       -- 4-bit input: Port A byte-wide write enable input
    -- Port A Data: 32-bit (each) input: Port A data
    DIA => DIA,       -- 32-bit input: A port data input
    DIPA => DIPA,     -- 4-bit input: A port parity input
    -- Port B Address/Control Signals: 14-bit (each) input: Port B address and control signals
    ADDRB => ADDRB,   -- 14-bit input: B port address input
    CLKB => CLKB,     -- 1-bit input: B port clock input
    ENB => ENB,       -- 1-bit input: B port enable input
    REGCEB => REGCEB, -- 1-bit input: B port register clock enable input
    RSTB => RSTB,     -- 1-bit input: B port register set/reset input
    WEB => WEB,       -- 4-bit input: Port B byte-wide write enable input
    -- Port B Data: 32-bit (each) input: Port B data
    DIB => DIB,       -- 32-bit input: B port data input
    DIPB => DIPB      -- 4-bit input: B port parity input

 -- End of RAMB16BWER_inst instantiation

--todo: assertion on non-aligned 16b read?

CLKA <= I_clk;
CLKB <= I_clk;

ENA <= I_cs;
ENB <= '0';--port B unused

ADDRA <= I_addr(10 downto 1) & "0000";

process (I_clk, I_cs)
  if rising_edge(I_clk) and I_cs = '1' then
    if (I_we = '1') then
      if I_size = '1' then
        -- 1 byte
        if I_addr(0) = '1' then
          WEA <= "0010";
          DIA <= X"0000" & I_data(7 downto 0) & X"00";
          WEA <= "0001";
          DIA <= X"000000" & I_data(7 downto 0);
        end if;
        WEA <= "0011";
        DIA <= X"0000" & I_data(7 downto 0)& I_data(15 downto 8);
      end if;
      WEA <= "0000";
      WEB <= "0000";
      if I_size = '1' then
        if I_addr(0) = '0' then
          data(15 downto 8) <= X"00";
          data(7 downto 0)  <= DOA(7 downto 0);
          data(15 downto 8) <= X"00";
          data(7 downto 0)  <= DOA(15 downto 8);
        end if;
        data(15 downto 8) <= DOA(7 downto 0);
        data(7 downto 0) <= DOA(15 downto 8);
      end if;
    end if;
  end if;
end process;

O_data <= data when I_cs = '1' else "ZZZZZZZZZZZZZZZZ";

Assembler Output

The last thing to do was to add another output file generator to TASM, my c# TPU assembler. This simply outputs the whole 2KB initialization table for the input assembly. It’s then just copy/pasted into the VHDL in the appropriate attribute location.

Wrapping up

That’s it for this part. I really hope to have the next part with TPU talking to a peripheral device (and some changes to the ISA) in the next week or two. Fingers crossed!

Thanks for reading, comments as always to @domipheus.


Designing a CPU in VHDL, Part 9: Byte addressing, memory subsystem and UART

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing. This part is heavy going if you’ve not read the previous posts.

Byte Addressing

TPU currently operates with memory by addressing 16-bit words. It’s a fairly common set-up for custom processors (addressing non-‘byte’ sizes, that is), but I wanted byte addressing as it simplifies some assembly for operations and really shouldn’t be that difficult to add. There are various things that need to change:

  • The PC needs to increment by 2 each instruction cycle, not 1
  • The read and write memory instructions need to have a size flag
  • The assembler needs to now calculate offsets knowing instructions are 2 bytes and now 2 memory addresses wide
  • Our embedded RAM needs to be able to perform operations only on byte-widths, and also address as bytes.

The PC increment is trivial, simply changing the increment operator of the PC unit to add 2 instead of one. The embedded ram changes are not so bad either. We add a new input port to say whether or not the current operation is on bytes or words. When we are reading just a byte, we want to zero out the high byte of the output port and write just the low byte from memory. For a word operation, we perform two memory/ram reads and write them into the low and high parts of the output. It’s worth noting what we are doing here is technically big-endian; in that, when we do a read from a byte address, the most-significant byte is located at that address, followed by the least significant byte. I also added a chip select, which ‘disconnects’ the output when not enabled.

type store_t is array (0 to 103) of std_logic_vector(7 downto 0);
signal ram: store_t := ...

... snip ...

process (I_clk, I_cs)
  if rising_edge(I_clk) and I_cs = '1' then
    if (I_we = '1') then
      if I_size = '1' then
        -- 1 byte
        ram(int_addr) <= I_data(7 downto 0);
        ram(int_addr) <= I_data(15 downto 8);
        ram(int_addr+1) <= I_data(7 downto 0);
      end if;
      if I_size = '1' then
        data(15 downto 8) <= X"00";
        data(7 downto 0)  <= b1;
        data(15 downto 8) <= b1;
        data(7 downto 0)  <= b2;
      end if;
    end if;
  end if;
end process;

int_addr <= to_integer(unsigned(I_addr(7 downto 0)));

b1 <= ram(int_addr);
b2 <= ram(int_addr+1);

O_data <= data when I_cs = '1' else "ZZZZZZZZZZZZZZZZ";

The embedded ram gave me some problems when it came to synthesis – I had an internal compiler error in the Xilinx tools. I narrowed this down to a single line, and then a single token – one of the boundaries of a (X downto Y) statement. I re-wrote this component to get around the issue, and it also made me realise that this method of implementing the ram may be inefficient in terms of using the rams on-device. The version listed above is the new, re-written version. You can see there are multiple reads and multiple writes each cycle. I’ll need to look into how the internal block rams of the Spartan6 are used to make sure I’m not causing issues.

For the read and write instructions, I’d conveniently left a single bit of the instruction form free for later use. This is the ‘flag bit’ in position 8. When this bit is set, it instructs the memory system to do a byte operation instead of a word operation.

ISAmemoryreadThe changes to the assembler were pretty trivial; enable the read and write instructions for byte operations using the flag bit, and the changes required for byte addressing when calculating label offsets. With those changes, everything worked well and I was able to progress. An interesting point to note now is that I could enforce instructions being on 2-byte boundaries, which would mean all immediate branch offsets could be increased by an extra bit as they are currently. I’ve not done this, yet, but I likely will.

Memory Subsystem

I’d always intended to extend the memory subsystem of TPU as to be able to interface with memories that had different latencies. Currently, it’s fixed that memory takes a single cycle to respond. I wanted to expose an interface that would enable connecting up other memory-mapped devices (like a UART) or even enable the interfacing of SDRAM on the miniSpartan6+ board.

For this I’ve created a ‘core’ component which brings everything bar the embedded ram together in a single object. The memory is handled by an internal controller, which works as a state machine and is triggered by a command port. The state machine ports are exposed like the following.

entity mem_controller is
  Port( I_clk : in  STD_LOGIC;
        I_reset : in STD_LOGIC;

        O_ready : out STD_LOGIC;
        I_execute: in STD_LOGIC;
        I_dataWe : in  STD_LOGIC;
        I_address : in  STD_LOGIC_VECTOR (15 downto 0);
        I_data : in  STD_LOGIC_VECTOR (15 downto 0);
        I_dataByteEn : in STD_LOGIC_VECTOR(1 downto 0);
        O_data : out  STD_LOGIC_VECTOR (15 downto 0);
        O_dataReady: out STD_LOGIC;

        MEM_I_ready: in STD_LOGIC;
        MEM_O_cmd: out STD_LOGIC;
        MEM_O_we : out  STD_LOGIC;
        MEM_O_byteEnable : out STD_LOGIC_VECTOR (1 downto 0);
        MEM_O_addr : out  STD_LOGIC_VECTOR (15 downto 0);
        MEM_O_data : out  STD_LOGIC_VECTOR (15 downto 0);
        MEM_I_data : in  STD_LOGIC_VECTOR (15 downto 0);
        MEM_I_dataReady : in STD_LOGIC
end mem_controller;

In this component we have the internal interface, and then the external interface (prefixed with “MEM_”). The order of operations is as follows:

  1. Wait until O_ready is active
  2. Set the various inputs: address, in data or write enable, and the dataByteEn – this enables 16 or 8-bit memory modes.
  3. Signal the I_execute input until O_ready goes inactive.

This then gets routed to the external ports. I wanted to have something like this within the core object in case the need arose for implementing segmentation or other types of memory addressing extensions. Then the external address buses would be wider, supplemented by a register somewhere to set the high order bits. The actual code for the memory_controller is trivial just now:

architecture Behavioral of mem_controller is
  signal we : std_logic := '0';
  signal addr : STD_LOGIC_VECTOR (15 downto 0) := X"0000";
  signal indata: STD_LOGIC_VECTOR (15 downto 0) := X"0000";
  signal byteEnable: STD_LOGIC_VECTOR ( 1 downto 0) := "11";
  signal cmd : STD_LOGIC := '0';
  signal state: integer := 0;

  process (I_clk, I_execute)
    if rising_edge(I_clk) then
      if I_reset = '1' then
        we <= '0';
        cmd <= '0';
        state <= 0;
      elsif state = 0 and I_execute = '1' and MEM_I_ready = '1' then
        we <= I_dataWe;
        addr <= I_address;
        indata <= I_data;
        byteEnable <= I_dataByteEn;
        cmd <= '1';
        O_dataReady <= '0';
        if I_dataWe = '0' then
          -- read
          state <= 1;
          state <= 2;-- write
        end if;
      elsif state = 1 then
        cmd <= '0';
        if MEM_I_dataReady = '1' then
          O_dataReady <= '1';
          state <= 2;
        end if;
      elsif state = 2 then
        cmd <= '0';
        state <= 0;
        O_dataReady <= '0';
      end if;
    end if;
  end process;
  O_ready <= ( MEM_I_ready and not I_execute ) when state = 0 else '0';
  MEM_O_cmd <= cmd;
  O_data <= MEM_I_data;
  MEM_O_byteEnable <= byteEnable;
  MEM_O_data <= indata;
  MEM_O_addr <= addr;
  MEM_O_we <= we;

end Behavioral;

The main point this serves in terms of TPU is that the control unit uses the output signals from the memory controller at the point when deciding to move to the next stage of the pipeline. This means that stages such as the fetch stage, and the memory stage, wait until the memory subsystem has indicated a command has completed:

For a write, we know the command has executed when the O_ready output goes back to active after deactivating after the signalling of I_execute.
For a read, we know the command has executed when the data we requested presents itself on the O_data line, with O_dataReady signalling the data on the line is valid for the previous request.

The controller does not have any buffering or a queue of operations, so everything waits for the memory system before continuing. At the moment, this is what I want as it makes debugging the system so much easier when things go wrong.

The ‘Core’

Our core TPU object looks like the following:

entity core is
  Port (I_clk : in  STD_LOGIC;
        I_reset : in  STD_LOGIC;
        I_halt : in  STD_LOGIC;

        -- memory interface
        MEM_I_ready : IN  std_logic;
        MEM_O_cmd : OUT  std_logic;
        MEM_O_we : OUT  std_logic;
        MEM_O_byteEnable : OUT  std_logic_vector(1 downto 0);
        MEM_O_addr : OUT  std_logic_vector(15 downto 0);
        MEM_O_data : OUT  std_logic_vector(15 downto 0);
        MEM_I_data : IN  std_logic_vector(15 downto 0);
        MEM_I_dataReady : IN  std_logic
end core;

There are a few signals yet to add here; for one, there are no interrupts yet – something I’d like to add. But as you can see, the core TPU object now just exposes the memory interface, along with the clock and some control. If you apply the clock, a 16-bit request to read address 0 will be asserted, as it attempts to fetch it’s first instruction to execute.

In making a top-level module which we can flash to the FPGA, we need to have one of the cores, and an instance of our embedded ram. We also need to have some sort of logic which can handle the memory system commands from the core – and either let the embedded RAM service it, or some other component – such as our UART.

core_1: core PORT MAP (
  I_clk => I_clk,
  I_reset => I_reset,
  I_halt => I_halt,
  MEM_I_ready => MEM_I_ready,
  MEM_O_cmd => MEM_O_cmd,
  MEM_O_we => MEM_O_we,
  MEM_O_byteEnable => MEM_O_byteEnable,
  MEM_O_addr => MEM_O_addr,
  MEM_O_data => MEM_O_data,
  MEM_I_data => MEM_I_data,
  MEM_I_dataReady => MEM_I_dataReady

ebram_1: ebram Port map ( 
  I_clk => I_clk,
  I_cs => CS_ERAM,
  I_we => MEM_O_we,
  I_addr => MEM_O_addr,
  I_data => MEM_O_data,
  I_size => ram_req_size,
  O_data => ram_output_data

uart_1: uart_simple PORT MAP (
  I_clk => I_clk,
  I_clk_baud_count => I_clk_baud_count,-- 0x1100 -- R/W
  I_reset => I_reset,             
  I_txData => I_txData,                -- 0x1102  -- W
  I_txSig => I_txSig,                  -- 0x1103  -- W
  O_txRdy => O_txRdy,                  -- 0x1104  -- R
  O_tx => O_tx,
  I_rx => I_rx,
  I_rxCont => I_rxCont,                -- 0x1105  -- W
  O_rxData => O_rxData,                -- 0x1106  -- R
  O_rxSig => O_rxSig,                  -- 0x1107  -- R
  O_rxFrameError => O_rxFrameError     -- 0x1108  -- R

In addition to the above parts of the top-level design, there is the LED byte output which ends up mapping to the LEDS on the miniSpartan6+ board. those are written to at location 0x1000, and looking at the uart_1 definition above, the comments indicate which other signals are mapped and at what location. Many of these are 1-bit signals, but I use a whole byte for simplicity.

Whilst the inputs to the embedded ram is connected directly from the MEM_signals, the output (O_data) is mapped to another signal, as any address above 0x1000 I have defined for now as not existing within the embedded RAM.

ram_req_size <= '1' when MEM_O_byteEnable = "10" else '0';
CS_ERAM <= '1' when MEM_O_addr < X"1000" else '0';
MEM_I_data <= ram_output_data when CS_ERAM = '1' else IO_DATA;

ram_output_data is selected by the CPU core appropriately, and there is an IO_DATA signal for any other memory-mapped device. Additionally, ram_req_size is used to translate between the byte enable signal (2 bits corresponding to byte locations) to the 8/16-bit embedded ram input port input.

Now, there is a process which manages the MEM_cmd and ready signals, and forwards/directs data where it needs to go depending on those commands. I’ll paste the whole thing, but it’s a bit hacky and is my first attempt – I’ll probably end up reimplementing it when I need SDRAM integration.

MEM_proc: process(I_clk)
    if rising_edge(I_clk) then
      if MEM_readyState = 0 then
        if MEM_O_cmd = '1' then
          -- LEDS memory map
          if MEM_O_addr = X"1000" then
            -- leds
            IO_LEDS <= MEM_O_data( 7 downto 0);
          end if;
          -- UART1 memory map
          case MEM_O_addr is
            when X"1100" =>
              if MEM_O_we = '0' then
                IO_DATA <= I_clk_baud_count;
                I_clk_baud_count <= MEM_O_data;
              end if;
            when X"1102" =>
              if MEM_O_we = '1' then
                I_txData <= MEM_O_data(7 downto 0);
              end if;
            when X"1103" =>
              if MEM_O_we = '1' then
                I_txSig <= MEM_O_data(0);
              end if;
            when X"1104" =>
              if MEM_O_we = '0' then
                IO_DATA <= X"000" & "000" & O_txRdy;
              end if;
            when X"1105" =>
              if MEM_O_we = '1' then
                I_rxCont <= MEM_O_data(0);
              end if;
            when X"1106" =>
              if MEM_O_we = '0' then
                IO_DATA <= X"00" & O_rxData;
              end if;
            when X"1107" =>
              if MEM_O_we = '0' then
                IO_DATA <= X"000" & "000" & O_rxSig;
              end if;
            when X"1108" =>
              if MEM_O_we = '0' then
                IO_DATA <= X"000" & "000" & O_rxFrameError;
              end if;
            when others =>
          end case;
          MEM_I_ready <= '0';
          MEM_I_dataReady  <= '0';
          if MEM_O_we = '1' then
            MEM_readyState <= 2;
            MEM_readyState <= 1;
          end if;
        end if;
      elsif MEM_readyState >= 1 then
        if MEM_readyState = 1 then
          MEM_I_ready <= '1';
          MEM_I_dataReady  <= '1';
          MEM_readyState <= 0;
        elsif MEM_readyState = 2 then
          MEM_I_ready <= '1';
          MEM_I_dataReady  <= '0';
          MEM_readyState <= 0;
          MEM_readyState <= MEM_readyState + 1;
        end if;
      end if;
    end if;
  end process;

As we disable the embedded ram when the address is >= 0x1000, we don’t need to worry about overwriting memory when looking for a memory mapped device. The LEDS are mapped with a simple if block, and the UART with a case. With this, the TPU core can access memory, write status to LEDS and also use the UART by reading and writing known memory locations.

We can simulate it with a simple test and it works well.

top_module_tbUsing the UART

The UART itself is pretty much exactly the same one in my previous post on the subject. Ive fixed a few issues in the original code (some of which were mentioned in the article). All the ports are memory mapped, so to use the component from TPU assembly, we just read and write to those addresses. This is where 1) the byte addressing mode and 2) the immediate offset of memory addresses in the new read/write instructions help a lot. The general method for transmitting a byte over the UART is as follows:

  1. Wait until the ready bit is set
  2. Set the data input value
  3. Set the transmit signal bit
  4. Wait until the device signals not ready (i.e, working on a request)
  5. Unset the transmit signal bit

For receive, it’s similar:

  1. Ensure the RX is enabled
  2. Wait until the receive signal bit becomes active
  3. Read the output data value
  4. Wait until the receive signal bit de-asserts

In addition, the baud rate can be set by adjusting a counter value which is memory mapped. If you want a baud rate of 9600, you’d set the value to 5208 – which is the main system clock (50MHz) divided by the bit-rate required (50000000/9600).

In TPU assembly, the send and receive functions look simple enough.

  load.l  r1, 1
  load.h  r2, 0x11              #uart1 memory mapped offset r2=0x1100
  read.b  r3, r2, 4             #readb [r2+4] (tx_ready) 
  cmp.u   r3, r3, r3            #compare  r3, %us_waitready     #loop until tx_ready nonzero
  write.b r2, r0, 2             #write txdata register
  write.b r2, r1, 3             #set txsig register 1
  read.b  r3, r2, 4             
  cmp.u   r3, r3, r3 r3, %us_waitunready   #loop until tx_ready zero
  load.l  r1, 0
  write.b r2, r1, 3
  br r6

  load.l  r1, 1
  load.h  r2, 0x11
  write.b r2, r1, 5
  read.b  r3, r2, 7             
  cmp.u   r3, r3, r3  r3, %ur_waitsig    #loop until rx_sig nonzero
  read.b  r0, r2, 6          #read data into r0
  read.b  r3, r2, 7             
  cmp.u   r3, r3, r3 r3, %ur_waitunsig  #loop until rx_sig nonzero
  br r6

In these examples, the ‘functions’ are called with any argument in r0, the return value placed in r0, and the return PC to jump back to in r6. You can send and ‘H’ over the UART using:

  load.l r0, 0x48
  load.l r6, $L_E
  bi $uart_send
  ... snip ...

And receive sets r0 on return to the value sent across the connection.

I’m still assembling these programs and initializing the embedded RAM with those instructions at present; but this UART now gives the possibility of having a fixed bootloader which loads a binary blob from the serial port and then executes it. The assembled instructions can be output in VHDL for easy integration. Here is how an example which prints ‘HELLO’ after receiving a byte (which is discarded) and then continues to echo everything received back to the sender looks.

X"8D", X"04", -- 0000: load.l  r6 0x0004
X"C1", X"4E", -- 0002: bi      0x004e
X"81", X"48", -- 0004: load.l  r0 0x48
X"8D", X"0A", -- 0006: load.l  r6 0x000a
X"C1", X"34", -- 0008: bi      0x0034
X"81", X"45", -- 000A: load.l  r0 0x45
X"8D", X"10", -- 000C: load.l  r6 0x0010
X"C1", X"34", -- 000E: bi      0x0034
X"81", X"4C", -- 0010: load.l  r0 0x4c
X"8D", X"16", -- 0012: load.l  r6 0x0016
X"C1", X"34", -- 0014: bi      0x0034
X"81", X"4C", -- 0016: load.l  r0 0x4c
X"8D", X"1C", -- 0018: load.l  r6 0x001c
X"C1", X"34", -- 001A: bi      0x0034
X"81", X"4F", -- 001C: load.l  r0 0x4f
X"8D", X"22", -- 001E: load.l  r6 0x0022
X"C1", X"34", -- 0020: bi      0x0034
X"81", X"0D", -- 0022: load.l  r0 0x0d
X"8D", X"28", -- 0024: load.l  r6 0x0028
X"C1", X"34", -- 0026: bi      0x0034
X"81", X"00", -- 0028: load.l  r0 0
X"8D", X"2E", -- 002A: load.l  r6 0x002e
X"C1", X"4E", -- 002C: bi      0x004e
X"8D", X"28", -- 002E: load.l  r6 0x0028
X"C1", X"34", -- 0030: bi      0x0034
X"C1", X"64", -- 0032: bi      0x0064
X"83", X"01", -- 0034: load.l  r1 1
X"84", X"11", -- 0036: load.h  r2 0x11 #uart1 memory mapped offset
X"67", X"44", -- 0038: read.b  r3 r2 4 
X"96", X"6C", -- 003A: cmp.u   r3 r3 r3
X"D3", X"7C", -- 003C:  r3 -4   #loop until tx_ready nonzero
X"70", X"42", -- 003E: write.b r2 r0 2 #write txdata register
X"70", X"47", -- 0040: write.b r2 r1 3  
X"67", X"44", -- 0042: read.b  r3 r2 4  
X"96", X"6C", -- 0044: cmp.u   r3 r3 r3
X"D7", X"7C", -- 0046: r3 -4  
X"83", X"00", -- 0048: load.l  r1 0
X"70", X"47", -- 004A: write.b r2 r1 3
X"C0", X"C0", -- 004C: br      r6
X"83", X"01", -- 004E: load.l  r1 1
X"84", X"11", -- 0050: load.h  r2 0x11
X"72", X"45", -- 0052: write.b r2 r1 5
X"67", X"47", -- 0054: read.b  r3 r2 7
X"96", X"6C", -- 0056: cmp.u   r3 r3 r3
X"D3", X"7C", -- 0058:  r3 -4 
X"61", X"46", -- 005A: read.b  r0 r2 6  
X"67", X"47", -- 005C: read.b  r3 r2 7
X"96", X"6C", -- 005E: cmp.u   r3 r3 r3
X"D7", X"7C", -- 0060: r3 -4 
X"C0", X"C0", -- 0062: br      r6
X"C1", X"64", -- 0064: bi      0x0064
X"00", X"00"  -- 0066: dw      0x0000

Using the miniSpartan6+ FTDI chip

One thing mentioned in the post about my own UART implementation is how I used a Teensy3.1 / external USB->serial TTL cable to interface with the FPGA.

I was contacted on twitter and made aware that the FTDI USB chip on the miniSpartan6+ is dual channel, and as well as interfacing with JTAG to flash the FPGA there is an additional COM port exposed through USB, and a TX/RX line connected to the FPGA. By connecting the TX and RX lines of my top level design to these pins, I can use the onboard USB to communicate with TPU!

This is done by assigning the external TX and RX ports of TPU to the pins on the Spartan6 that are connected to the FTDI USB chip. In the UCF constraints file:


Will expose those pins as channel B of the FTDI chip. It seems to communicate at 115200 baud, and in my tests it works well. Executing the TPU assembly above, with this USB->TPU setup, we can connect to the com port and communicate:

putty_helloThat ‘HELLO’ is hella-simple, but it shows a memory-mapped peripheral interface working with TPU, which is quite the milestone!

The State of TPU

TPU now has a decent set of instructions, and the ability to call functions, set up stacks, perform arithmetic and operate memory-mapped peripherals. The current top-level view of the system is below:

arch_overview_1You can see we have some embedded RAM, our UART, and also a memory-mapped register for setting the LEDS on the miniSpartan6+ as well as reading the on-board switches. The core is our new component that simply exposes our memory interface. It’s a bit more than just a CPU, but we need all this to get the system working!

floorplanAt the moment we’re using more of the FPGA than I’d like – mostly due to the UART. It takes a real chunk out, bringing utilization up to 33%. I don’t really mind this, as I know there is a lot of stuff implemented in very bad ways. So I’m not worried – this value will fall.

I’m not quite ready to update the Github with what’s here (this was very rushed) but the new ISA can be found here.

That about wraps up this post. I’m currently looking into interfacing something with the TPU via an additional UART which will be fun (and fairly ridiculous, really) – so there is that to look forward to!

Thanks for reading, comments as always to @domipheus.


A UART Implementation in VHDL

I’m still working on my Soft-CPU TPU, but wanted to implement a communications channel for it to use in order to get some form of input and output from it. The easiest way to do this is to use a UART, and connect it to a USB to Serial converter for logic-level asynchronous communications.

Knowing that I’m still pretty new to VHDL and working with FPGA systems in general at this level, I decided to develop my own UART implementation. Some may roll their eyes at this, knowing there are plenty out there, and even constructs to utilize real hardware on the Spartan 6 FPGA I’m using; but I’m a fan of learning by doing.

Serial Communications

What I’m implementing is a transmitter and receiver which can operate at any baud rate, with 8 data bits, no parity and 1 stop bit. It should be able to communicate over a COM post to a PC, or to another UART. It’s working at Logic-Level voltages, which is very important – you need to use a logic level USB-Serial cable for this. Using an RS232 serial will damage things if it uses the higher voltages specified.

Looking at how we transmit, the waveform looks as follows:

txAssuming that the ‘baud’ clock is running at the correct frequency we require, you can see that it’s fairly simple how all of this works. The idle state for the TX line is always logic high. This may seem weird, but historically the distances the wires crossed meant they were susceptible to damage, and having the idle state high meant if any problem occurred with the physical wires, you’d know about it very quickly.

To transmit an 8-bit byte, a start bit is emitted which is logic low. One ‘baud tick’ later, the least significant bit of the byte is sent, and then every baud tick follows the next bit until the most significant bit is sent. Finally, a stop bit is sent, which is logic high. At this point another byte can be sent immediately – or the line left idle to transmit later, after a delay.

Transmitter States

The transmitter is very simple. There is a data byte input, and a txSig port which is used to signal that the bits on the data output should be sent. When txSig is asserted, state moves from idle to a start state where a start bit is issued. From there, we progress to the data state, where the 8 bits of data are pushed least-significant-bit to output. Finally there is the stop bit state, before moving back to idle, or straight back to start in the case data is being streamed out.

tx_statesFor the states, I use an integer signal as it seemed the simplest and generally the most obvious way to go about it. The whole transmitter code is below.

tx_proc: process (tx_clk, I_reset, I_txSig, tx_state)
  -- TX runs off the TX baud clock
  if rising_edge(tx_clk) then
    if I_reset = '1' then
      tx_state <= 0;
      tx_data <= X"00";
      tx_rdy <= '1';
      tx <= '1';
      if tx_state = 0 and I_txSig = '1' then
        tx_state <= 1;
        tx_data <= I_txData;
        tx_rdy <= '0';
        tx <= '0'; -- start bit
      elsif tx_state < 9 and tx_rdy = '0' then
        tx <= tx_data(0);
        tx_data <= '0' & tx_data (7 downto 1);
        tx_state <= tx_state + 1;
      elsif tx_state = 9 and tx_rdy = '0' then
        tx <= '1'; -- stop bit
        tx_rdy <= '1';
        tx_state <= 0;
      end if;
    end if;
  end if;
end process;

As you can see from the VHDL, the start state is tx_state=0, the data state is tx_state=1..8 and the stop state is tx_state=9. The process is idle when tx_state is 0 with I_txSig=0. The tx_clk baud clock is generated from the higher-frequency system clock using a counter:

  -- TX standard baud clock, no reset
  if tx_clk_counter = 0 then
    -- chop off LSB to get a clock
    tx_clk_counter <= to_integer(unsigned(I_clk_baud_count(15 downto 1)));
    tx_clk <= not tx_clk;
    tx_clk_counter <= tx_clk_counter - 1;
  end if;

The way the I_clk_baud_count value is set is as follows:

I_clk in Hz / expected baud = I_clk_baud_count

So, for 9600bps on a system using a 50MHz clock, one would assign 50000000/9600, or 5208, to I_clk_baud_count. The TX clock is generated by negating the clock signal every 5208/2 counts of the system clock.

Testing the transmitter

Testing this is fairly simple. Auto-generating a test bench, all we need to do is put data on the in port, and then toggle the txSig signal input.

-- send hello\n - 0x48, 0x45, 0x4c, 0x4c, 0x4f, 0x0d
-- H
I_txData <= X"48";
I_txSig <= '1';
wait until O_txRdy= '0';
I_txSig <= '0';

 -- E
wait until O_txRdy= '1';
I_txData <= X"45";
I_txSig <= '1';
wait until O_txRdy= '0';
I_txSig <= '0';


tx_tb_workingThe simulation waveform results in correct TX output, which is great. I wrote up a top-level vhdl module and flashed the MiniSparan6+ board with the same style of test (obviously not using waits – it just endlessly looped over an array containing ‘hello’) and connecting to the FPGA via putty showed the TX did indeed work for this case. Time to implement receive!


Receiving needs to be implemented differently from transmit. That statement is obvious, but it’s all about how the timing is managed and where the serial input is sampled.

If we use our existing example of how we generated the TX output, and use those methods for RX, the below waveform will be the ideal situation.

rx_idealSampling on the rising edge of baud_clk we can see the sampled data is correct; a start bit, 8 did bits, and the stop bit (named ‘e’ just to differentiate). However, we do not control the timing of the RX input. It can be out of phase with the clock, and once it’s out of phase significantly the samples can result in incorrect data, as shown below. Additionally, there is a percentage error allowed in serial communications, and as this error accumulates it can confuse the receiver.

rx_badWe need to use a higher-rate clock, as to lower the accumulated error across a received frame.

Super-sampling (or not)

Generally when working with other UART RX hardware you’ll see mention of a clock at higher than the baud rate. This is due to the internals super-sampling the RX input and then trying to get the ideal sample area, right in the middle of the transmitted bit. For my implementation I cheated, and still only sample once per bit, but I use a 16x baud tick along with a counter for working out where the next bit is likely to be.

The falling edge of the start bit is always sampled at the system clock, in my case, 50MHz. When found, the 16x baud counter is reset, which re-aligns the baud ticks with the start bit. A counter is reset too; as there are 16 baud ticks per bit when receiving, I then sample the start bit when the counter reaches 7, move to the next state, and reset. It will then sample for data bit 0 when the counter reaches 16, and so on, until we have a whole byte of data and an end bit.

sampling_diagramYou can see the 16 baud ticks per transmitted bit in the simulator:

16x_baud_sample_ticksAnd, even see the ticks reset to re-align them with the RX when the start bit is sampled:

rx_start_bit_edge_clk_reset_croppedThe RX Code

The RX clock is based off two things, the main system clock, and then a counter which generates a ‘tick’ every 16x baud clock (the one TX uses). To generate the ticks, we use the code below.

-- RX baud 'ticks' generated for sampling, with reset
if rx_clk_counter = 0 then
  -- x16 sampled - so chop off 4 LSB
  rx_clk_counter <= to_integer(unsigned(I_clk_baud_count(15 downto 4)));
  rx_clk_baud_tick <= '1';
  if rx_clk_reset = '1' then
    rx_clk_counter <= to_integer(unsigned(I_clk_baud_count(15 downto 4)));
    rx_clk_counter <= rx_clk_counter - 1;
  end if;
  rx_clk_baud_tick <= '0';
end if;

There are a few things that could probably be optimized here, but I kept it simple for readability reasons. Note that the RX counter generating the baud ticks has a reset, unlike the transmit clock.

The actual code for the RX is presented below, with the reset, and initial start bit detection.

rx_proc: process (I_clk, I_reset, I_rx, I_rxCont)
  -- RX runs off the system clock, and operates on baud 'ticks'
  if rising_edge(I_clk) then
    if rx_clk_reset = '1' then
      rx_clk_reset <= '0';
    end if;
    if I_reset = '1' then
      rx_state <= 0;
      rx_sig <= '0';
      rx_sample_count <= 0;
      rx_sample_offset <= OFFSET_START_BIT;
      rx_data <= X"00";
      O_rxData <= X"00";
    elsif I_rx = '0' and rx_state = 0 and I_rxCont = '1' then
      -- first encounter of falling edge start
      rx_state <= 1; -- start bit sample stage
      rx_sample_offset <= OFFSET_START_BIT;
      rx_sample_count <= 0;
      -- need to reset the baud tick clock to line up with the start 
      -- bit leading edge.
      rx_clk_reset <= '1';

Skipping the clock reset, which needs to be in this process (this process writes that signal, the other clock-generating process reads it) we have the initial state for the receiver, rx_state=0. This is the initial detection of the start bit, which is rx=’0′, and sampled every system clock cycle (50MHz). Once we find these, and the rxCont input is active (which is basically RX enable) we move to state 1 and set the sample offset to OFFSET_START_BIT, which I can assure you is 7!

    elsif rx_clk_baud_tick = '1' and I_rx = '0' and rx_state = 1 then
      -- inc sample count
      rx_sample_count <= rx_sample_count + 1;
      if rx_sample_count = rx_sample_offset then
        -- start bit sampled, time to enable data
        -- this should check RX here. if it =1, should revert to state 0
        rx_sig <= '0';
        rx_state <= 2;
        rx_data <= X"00";
        rx_sample_offset <= OFFSET_DATA_BITS; 
        rx_sample_count <= 0;
      end if;
    elsif rx_clk_baud_tick = '1' and rx_state >= 2  and rx_state < 10 then
      -- sampling data
      if rx_sample_count = rx_sample_offset then
        rx_data(6 downto 0) <= rx_data(7 downto 1);
        rx_data(7) <= I_rx;
        rx_sample_count <= 0;
        rx_state <= rx_state + 1;
        rx_sample_count <= rx_sample_count + 1;
      end if;
    elsif rx_clk_baud_tick = '1' and rx_state = 10 then
      if rx_sample_count = OFFSET_STOP_BIT then
        rx_state <= 0;
        rx_sig <= '1';
        O_rxData <= rx_data; -- latch data out
        if I_rx = '1' then 
          -- stop bit correct
          rx_frameError <= '0';
          -- stop bit is always high, if we don't see it, there 
          -- has been an issue. Signal an error.
          rx_frameError <= '1';
        end if;
        rx_sample_count <= rx_sample_count + 1;
      end if;
    end if;
  end if;
end process;

The second half simply moves across all the states, sampling the start bit when our sample count gets to the offset required (for the start bit, 7). It then moves state to sampling the 8 data bits, with offsets of 16 a time, before checking the stop bit and indicating whether a frame error has occurred (the stop bit should be ‘1’).

Note that there is no real error checking, or input validation here. Technically, if rx_state=1 and our RX input is high, we should reset the RX system and go back to state 0 – as it was most likely a blip on the input, and not a real serial frame of data. I’ll probably add that later.

Other modules using this RX will use the O_rxSig output to indicate data received, and grab a new byte from the data output port. It stays high until a new frame begins receiving.

Testing the Receiver

Like the testing for transmit, I created a standard test bench, and filled it out with the usual content. For RX, I created a second clock with the same time period as the baud clock the UART is configured to use. I then have an array of the serial bitstream, and each rising edge of that clock I push the next bit across. You can see the whole test on github. Running it in the simulator, it works:

rx_tb_workingI also created a loopback test, which uses the transmitter test bench in two speeds, feeding the TX line into the RX to ensure the data is correct. I’ve got the waveform from a run of that test below, also zoomed into the area where 115200bps is active (at the start).

loopback_tbRunning On Hardware

Running on hardware is easy, just assign the RX and TX ports/nets to external pins. I created a loopback top-level module so I can type into a Putty serial session and see what I type echo back.

The loopback module for the hardware uses another state machine depending on the various UART module signals. The process is fairly simple, and will just push any input from the RX to the TX and set relevant states at the correct times.

loopback: process(I_clk_50mhz, O_rxSig)
  if rising_edge(I_clk_50mhz) then
    if O_rxSig = '1' and last_msg_valid ='0' and new_message = '1' then
      last_msg <= O_rxData;
      last_msg_valid <= '1';
      new_message <= '0';
    elsif O_txRdy = '1' and I_txSig = '0' and last_msg_valid = '1' then
      I_txData <= last_msg;
      I_txSig <= '1';
    elsif I_txSig = '1' and O_txRdy = '0' and last_msg_valid = '1' then
      I_txSig <= '0';
      last_msg_valid <= '0';
    elsif O_rxSig = '0' then
      new_message <= '1';
    end if; 
  end if;
end process;

messSadly, I could not find my USB to Serial TTL converter in my lab mess. It’s in there somewhere. But I did find an old Teensy 3.1 (it’s actually from my TeensyZ80 build) which I used to forward serial to and from the miniSpartan6+. Keys I typed were echoed back, at 115200bps. So a successful test.

putty_loopbackspartan_teensyWrap Up

That pretty much finishes this post off. It’s by no means a finished implementation but works for what I need. I’ll be using it with TPU as a peripheral, and memory mapping the various ports as to control it from software. I think I’ll add some FIFO buffers to the input and output data lines to ensure I don’t loose data, implement the RX error checking I mentioned earlier, and also add a ‘number of frames’ counter for software-side error checking.

It should be made clear though, that there are probably UART constructs available within any recent FPGA what will take up much less resources than this, and they should be used in final projects where possible and sensible!

Thanks for reading, the code for this is available on github.

Designing a CPU in VHDL, Part 8: Revisiting the ISA, function calling, assembler

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

We’re at the point now where the CPU can run some more involved examples. The examples we’ve run to date on the simulator have been fairly simple, and more to the point, tailored to what we have available. I wanted to take a look back at the ISA, to see where we can make some worthwhile changes before moving forward.

Our more complex example code

Trivial 16-bit multiply!

It’s incredibly simple, again. But, that’s because we are missing some pretty fundamental functionality from the TPU. Even this tiny example exposes them.

The example I came up with is as follows:

  1. nominate a register for a stack location and set it.
  2. Set up a simple stack frame to execute a multiply function which takes two 16bit operands.
  3. Call the ‘mul16’ function
  4. in mul16()
    1. grab arguments from the stack
    2. perform the multiplication
    3. return our result in r0
  5. perform some sort of jump away to a safe place of code where we halt using an infinite loop.

This example, in code form, is similar to this:

ushort mul16( ushort a, ushort b)
  ushort sum = 0;
  while (b != 0)
    sum += a;
  return sum;

  ushort ret = mul16(3,7);
  while(1) {
    ret |= ret;

For this example, I defined r7 as the stack register. It was set to the top of our embedded ram block, and the stack will grow downwards. We need to store the two mul16 parameters, as well as our return address. As we address 16 bit words instead of the more typical 8-bit bytes, we only subtract 3 from the current stack pointer value. We then need to write in at various offsets our parameters:

sp = return PC
sp+1 = ushort a
sp+2 = ushort b

The first thing to notice is we are writing these values to constant offsets of a register value r7 (our SP). At the moment, our ISA only has a write to an address which is located in a register, so we need to perform writes and additions to a temporary register, or, we implement new functionality into TPU

Reads and Writes to memory with offset

Currently our write instruction takes a destination memory address specified in rA and a value to write specified in rB. The Read memory instruction is similar, but uses rD for the destination register, and rA as the address. This is due to rD being the only internal data select path into the register file.

Looking at the old instruction forms we have various unused bits that are enough to hold a significant offset value for our memory operations. In the case of the write instruction, these bits are non-contiguous, but we can solve that in the decoder. Our new read instruction looks like the following.

readWith our write instruction a little less clear coming in at


This is when having the immediate data output from the decoder 16-bits becomes useful. We extend the decoder to make those top 8 bits dependant on the instruction opcode, so that when a write is decoded, the immediate offset value is recombined ready for use by the ALU.

  O_dataIMM(15 downto 8) <= I_dataInst(IFO_RD_BEGIN downto IFO_RD_END)
            & I_dataInst(IFO_F2_BEGIN downto IFO_F2_END) & "000";
  O_regDwe <= '0';

The changes to the ALU are minimal, and we just do the inefficient thing of adding another adder. Knowing from the previous part that TPU currently takes up a tiny 3% of the Spartan6 LX25 resources, we can concentrate on getting functionality in rather than optimizing for space.

  -- The result is the address we want.
  -- First 5 bits of the Imm value is an offset.
  s_result(15 downto 0) <= std_logic_vector(signed(I_dataA) + signed(I_dataIMM(15 downto 11)));
  s_shouldBranch <= '0';
  -- The result is the address we want.
  -- Last 5 bits of the Imm value is an offset.
  s_result(15 downto 0) <= std_logic_vector(signed(I_dataA) + signed(I_dataIMM(4 downto 0)));
  s_shouldBranch <= '0';

You can see the ALU code is very similar. We treat the 5-bit immediate as a signed value, as [-16, 15] is a wide enough range of offsets, and being able to offset back as well as forward will come in very handy.

Calling Functions

Getting back to our example, we need to store the program location that we need to return to after executing our mul16 function. Amazingly, we didn’t have an instruction for getting the current PC, so this was impossible. It was very easy to add, though. The current PC is forwarded to the ALU – just use one of the two reserved opcodes we have free to define a set of special state operations.

spc_sstatusThe ALU code to serve these instructions is trivial.

when OPCODE_SPEC => 	-- special
  case I_dataIMM(IFO_F2_BEGIN downto IFO_F2_END) is
    when OPCODE_SPEC_F2_GETPC =>
      s_result(15 downto 0) <= I_PC;
       s_result(1 downto 0) <= s_result(17 downto 16);
    when others =>
  end case;
  s_shouldBranch <= '0';

The sstatus, or get status instruction, will be used to get overflow and carry status bits – which currently are not implemented.

Now that we can get the current PC value, we can use this to calculate the return address for our callee function to jump to on return. The assembly looks as follows.

  load.l  r7, 0x27    # Top of the stack
  load.l  r1, 7       # constant argument 2
  load.l  r2, 3       # constant argument 1
  subi    r7, r7, 3   # reserve 3 words of stack
  write   r7, r1, 2   # write argument at offset +2
  write   r7, r2, 1   # write argument at offset +1
  spc     r6          # get current pc
  addi    r6, r6, 4   # offset to after the call
  write   r7, r6      # put return PC on stack
  bi      $mul16      # call
  addi    r7, r7, 3   # pop stack

This creates a call stack for mul16 containing it’s two parameters, and the location of where it should branch to when it returns.

Immediate arithmetic

You may have noticed two new instructions in the above code snippet – addi and subi. These were added to account for the fact simply incrementing/decrementing registers needed an immediate load, which then used up one of our registers.

The add and sub instructions both have two unused flag bits, so one of them was used to signal intermediate mode. In this mode, rD and rA are used as normal, but rB is disregarded, and 5-bits are used to represent an unsigned immediate value.

addiI took the decision to use only unsigned versions of this instruction, as I thought if someone was really interested in proper overflow detection, they wouldn’t mind taking the additional register penalty, and use the existing add instruction using a register.

In the VHDL, I again didn’t care about resources, and simply added yet another if conditional with adders.

when OPCODE_ADD =>
  if I_aluop(0) = '0' then
    if I_dataImm(0) = '0' then
      s_result(16 downto 0) <= std_logic_vector(unsigned('0' & I_dataA) + unsigned( '0' & I_dataB));
      s_result(16 downto 0) <= std_logic_vector(unsigned('0' & I_dataA) + unsigned( '0' & X"000" & I_dataIMM(4 downto 1)));
    end if;
    s_result(16 downto 0) <= std_logic_vector(signed(I_dataA(15) & I_dataA) + signed( I_dataB(15) & I_dataB));
  end if;
  s_shouldBranch <= '0';

The last 8 bits in dataImm always contain the last 8 bits of our instruction word, so we just use that for both the immediate mode check and then for the 5 bits of value itself.

The mul16 Function

Lets recap the C style version of our function:

ushort mul16( ushort a, ushort b)
  ushort sum = 0;
  while (b != 0)
    sum += a;
  return sum;

And in the TPU assembly written so far, our stack pointed to by r7 resembles the following:

stackThe assembly code therefore, for the mul16 function, is as follows.

  read    r1, r7, 2
  read    r2, r7, 1
  load.l  r0, 0
  cmp.u   r5, r2, r2  r5, %mul16_fin
  add     r0, r0, r1
  subi.u  r2, r2, 1
  bi      $mul16_loop
  read    r6, r7, 0
  br      r6

Pretty simple stuff, but again – a new instruction! = branch to relative offset when A is zero.

Conditional Branch to relative offset

If you remember our previous parts discussing the conditional branching, and even our first part, you’ll remember that they could only branch to a target stored in a register. It was incredibly inefficient for small loops, taking up a register and bloating the code.

Before implementing relative offset branching, there was a need to make the conditional branching instructions more sane. The conditional bits in the instruction which form the type of condition were split and spread out in the instruction form, despite us not using the rD bits. This was changed, so we have a new instruction coding for conditional jumps:

bcondWith this now done, adding relative branch targets was fairly simple. The flag bit (8) is used to detect whether we branch to a register value or an immediate offset from the current PC:

broThe VHDL checks for the flag bit, and selects a different branch target.

  if I_aluop(0) = '1' then
     s_result(15 downto 0) <= std_logic_vector(signed(I_PC) + signed(I_dataIMM(4 downto 0)));
    s_result(15 downto 0) <= I_dataB;
  end if;

You can see the 5-bit immediate is signed, allowing conditional jumps backwards in the instruction stream. As any TIS-100 player will know, JRO’s backwards are very useful – especially in a multiplier 😉

The full multiplier test

I’ve put the full multiplier assembly listing below, which is bulky but I think helps in understanding the flow.

  load.l  r7, 0x27    # Top of the stack
  load.l  r1, 7       # constant argument 1
  load.l  r2, 3       # constant argument 2
  subi    r7, r7, 3   # reserve 3 words of stack
  write   r7, r1, 2   # write argument at offset +2
  write   r7, r2, 1   # write argument at offset +1
  spc     r6          # get current pc
  addi    r6, r6, 4   # offset to after the call
  write   r7, r6      # put return PC on stack
  bi      $mul16     # call
  addi    r7, r7, 3   # pop stack
  bi      $end

# Multiply two u16s. Doesn't check for overflow.
  read    r1, r7, 2
  read    r2, r7, 1
  load.l  r0, 0
  cmp.u   r5, r2, r2  r5, %mul16_fin
  add     r0, r0, r1
  subi.u  r2, r2, 1
  bi      $mul16_loop
  read    r6, r7, 0
  br      r6

  bi     $halt

  or     r0,r0,r0
  bi     $end

If this test works, we should be able to see r0 containing the result of our multiply (21 or 0x15) and the waveform should show the shouldBranch signal oscillating due to the end jump over an or. If shouldBranch is high at all times, we know we’ve hit halt so something isn’t quite right. I’ve not done typical calling convention things such as saving out volatile registers, but it’s easy to see how that would work. But i’m sure those reading by now will be wondering how I get those assembly listings into my test benches in VHDL.

The TPU Assembler – TASM

I have written a 1-file assembler in c# for the current ISA of TPU. In it’s thousand lines of uncommented splendour lies an abundance of coding horrors – fit for the Terrible Processing Unit. It works perfectly well for what I want – just don’t look too deep into it.

I wrote this in a few hours early on in the project, because as you can imagine, writing out instructions forms manually is tedious. The assembler is very simple and is fully self contained without any dependencies. It contains definitions for instructions, how to parse instruction forms, and how to write out their binary representation.

The functional flow for the assembler is as follows:

  1. Parse arguments and open input file
  2. for each line in the input file
    1. if it starts with a ‘#’, ignore it as a comment.
    2. split the line into strings by whitespace and commas
    3. If the first element ends with a ‘:’ treat it as a label and note it’s location
    4. Add the rest as instruction definitions to a list of inputs
  3. For each input definition, replace label names with actual values
  4. parse all definitions into a list of Operation Data objects
  5. Open output file
  6. Output the instruction data using a particular format generator

Assembler Features

The assembler accepts instruction mnemonics as per the ISA document, but will accept some additional ones – like add, which is simply treated as add.u.

There is a data definition (data/dw) which outputs 16-bit hex values directly to the instruction stream, it accepts outputting labels as absolute ($ prefix) and relative (% prefix), but does not currently support the ability to set the current location in memory of definitions – the first line is location 0x0000, and it continues from there.

Errors are not handled gracefully, and there is no real input checking. You could pass a relative offset into a conditional branch which is outside of the bounds of the instruction, and it will generate incorrect code. I’ll fix this stuff at a later date.

Output from the assembler is either binary, hex, or ‘eram’. The Embedded Ram (eram) format is basically VHDL initialization, with the original listing and offsets as comments. The example above assembles to the following:

X"8F27", -- 0000: load.l  r7 0x27 # Top of the stack
X"8307", -- 0001: load.l  r1 7 # constant argument 1
X"8503", -- 0002: load.l  r2 3 # constant argument 2
X"1EE7", -- 0003: subi    r7 r7 3 # reserve 3 words of stack
X"70E6", -- 0004: write   r7 r1 2 # write argument at offset +2
X"70E9", -- 0005: write   r7 r2 1 # write argument at offset +1
X"EC00", -- 0006: spc     r6 # get current pc
X"0CC9", -- 0007: addi    r6 r6 4 # offset to after the call
X"70F8", -- 0008: write   r7 r6 # put return PC on stack
X"C10C", -- 0009: bi      0x000c # call
X"0EE7", -- 000A: addi    r7 r7 3 # pop stack
X"C117", -- 000B: bi      0x0017
X"62E2", -- 000C: read    r1 r7 2
X"64E1", -- 000D: read    r2 r7 1
X"8100", -- 000E: load.l  r0 0
X"9A48", -- 000F: cmp.u   r5 r2 r2
X"D3A4", -- 0010:  r5 4
X"0004", -- 0011: add     r0 r0 r1
X"1443", -- 0012: subi.u  r2 r2 1
X"C10F", -- 0013: bi      0x000f
X"6CE0", -- 0014: read    r6 r7 0
X"C0C0", -- 0015: br      r6
X"C116", -- 0016: bi      0x0016
X"2000", -- 0017: or      r0 r0 r0
X"C117", -- 0018: bi      0x0017

And this is simply pasted into our VHDL ram objects. We need to pad it out to the correct size of the ram – but that is something I want to add as a feature, so you pass in the size of the eRAM and it automatically initializes the rest to zero. We can then simulate and see the TPU running well with the ISA additions.

mul16_simWrapping Up

I hope this has shown how easy it was to go in and fix some ISA mistakes made in the past and implement some new functionality. Also, it’s been nice to introduce TASM, despite the assembler itself being about as robust as a matchstick house.

The changes made to the VHDL has increased the resource requirement of the TPU on a Spartan6 LX25 from 3% to 5%, but an increase was expected given so many additional adders.

For next steps, I’m going to concentrate on the top-level VHDL entities for further deployment to miniSpartan6+.

Thanks for reading, comments as always to @domipheus.