Designing a CPU in VHDL, Part 10: Interrupts and Xilinx block RAMs

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

Part 10 was supposed to be a very big part, with a special surprise of TPU working with a cool peripheral device, but that work is still ongoing. It’s taking a long time to do, mostly due to being busy myself over the past few weeks. However, in this update, I’ll look at bringing interrupts to TPU, as well as fixing an issue with the embedded ram that was causing bloating of the synthesized design.


Interrupts are needed on a CPU which is expected to work with multiple asynchronous devices whilst also doing some other computation. You can always have the CPU poll, but sometimes that isn’t wise and/or suitable given other constraints. It’s also good for keeping time with something – vsync, for example. This is where interrupts come in – where a signal fed to the CPU externally can “interrupt” what the CPU is currently executing, and perform some other computation before returning to it’s previous task.

The way I have implemented the interrupts is similar to the Z80 maskable interrupts, with an external interrupt input and an interrupt acknowledge output. The system is simplified and doesn’t have the different types of modes and non-maskable interrupts available on the Z80 but it should be enough for the needs of TPU. You can only handle a single request at a time, and there is only one mode to work with – but it’s powerful enough for most situations.

An overview of how the interrupts will work are as follows:

  • At some point during execution, the system will make the interrupt input to TPU high, indicating they want the interrupt handler run.
  • At the next writeback stage of the pipeline, just before migrating to the fetch stage, the interrupt input is sampled.
  • If an interrupt is requested, the control unit will then make the interrupt acknowledge output from TPU active.
  • Once the interrupt ACK signal is seen externally to TPU, 16-bits of data can be placed on the data input to TPU.
  • After a predetermined number of cycles, the bits on the data in bus are stored.
  • The ACK is de-asserted, and the PC of TPU is set to the interrupt handler.
  • The handler can retrieve the data from the data bus via a new instruction, and also return to the previous PC before the interrupt was acknowledged.
  • The external interrupt input is latched, so until it goes inactive for a cycle, remaining active will not invoke another interrupt handler invocation.

It’s very important that the interrupt input is only acted upon during the end of the writeback stage. Doing it at any other point can result in an inconsistent execution state, whereby we do not know if the current instruction has executed to completion. Doing the interrupt at the end of a writeback means:

  1. the PC we save (to return to later) is already the ‘next’ PC, be that prev_pc+2, or a branch target;
  2. memory reads have had time to complete successfully; and
  3. any registers have had time to see and act upon write enable signals to store data.

The items that are needed, therefore, are:

  • Internal registers for the stored PC (to return to after interrupt handler), the interrupt data field passed on the data in bus, and an interrupt enable bit
  • Various connections between the parts of the sub-modules for handling storing of the PC and interrupt data
  • Control unit additions for the interrupt handler step
  • New instructions for getting interrupt data and returning from an interrupt

Internal registers & Connections

I added a 16-bit register for the ‘next PC’ and also the ‘interrupt data’ to the ALU itself, rather than adding it to the register file. There are individual set/write control lines and also data lines for them into the ALU. It’s a bit messy and adds a lot of ports to the ALU and control unit, but it worked and I can change this later if I want to tidy things up. Having the registers part of the ALU makes the instructions that access them incredibly simple and self contained.

Control unit additions

The control unit now has an interrupt state, all of the control signals for setting the registers in the ALU and also the logic for managing the phases of calling into the interrupt handler. If interrupts are enabled, the interrupt input is active and it’s the end of the writeback phase, the following occurs:

  1. Interrupt_ack is activated
  2. A cycle of latency is provided
  3. The bits on the data in bus are sampled and the ALU instructed to store this value
  4. The current PC (which is, at this point, the next instruction to execute) is saved by the ALU
  5. The PC unit sets the current PC to the interrupt vector, currently fixed at 0x0008.
  6. The control unit resets it’s interrupt state, and proceeds to the fetch stage of the pipeline.

At the moment, interrupts are not disabled automatically when the handler is invoked, so the first instruction must be a disable interrupt instruction.

New Instructions

There are four new instructions used to manage and handle interrupts.

giefThe Get Interrupt Event Field transfers the value on the data bus at the time after an interrupt acknowledge into a register for further use. Using this value, we can work out what caused the interrupt and perform further actions from that point. An example of this is using it with a UART, the interrupt data field could contain the uart identifier in the high 8 bits, and the byte of data which was received in the lower 8 bits.

bbiBranch back from Interrupt is similar to the reti instruction in the Z80. It branches back to the PC value which was due to be fetched next before the interrupt handler was invoked.

eiThe enable and disable interrupt instructions are fairly obvious.

The interrupt vector

The interrupt vector is fixed at address 0x0008. The shape of the interrupt handler should be something like the following:

  1. disable interrupts
  2. Save all registers
  3. get the interrupt event data field
  4. Perform action according to interrupt event field, or add the field data to a queue for later processing.
  5. restore all registers
  6. enable interrupts
  7. Branch back to ‘normal’ code.

Saving the registers can be done by saving to the current stack and then restoring before returning from the handler. I’ve been using r7 as a ‘standard’ stack pointer in our very ad-hoc ABI spec, so this can be done. This does use user stack, though, so it needs taken into account if stack space is a particular concern.

There are a few issues that could occur, mainly in timing between disabling and enabling the interrupts. There could be a new interrupt to be handled when the enable interrupts instruction is processed, and this interrupt will then be accepted before the bbi instruction to branch back. This will destroy the original PC value when the original interrupt was raised, so I will probably change things around. There are a few solutions to this, one being that interrupts are by definition disabled when the branch to the interrupt vector occurs, and then a bbi instruction implicitly turns interrupts on again. I’ll need to have a think about the best course of action for this.

The makeup of the test interrupt routines I’ve had are like the following (snipped for clarity)

  load.h  r7, 0x08
  subi    r7, r7, 4
  bi      $start
  dw      0x0000
intvec:   #interrupt vector 0x8
  # save the registers
  gief    r0
  #    inspect r0 for interrupt type
  #    branch to some other work
  # restore the registers
  load.l  r0, 0

The interrupt handler, whilst a bit messy in it’s implementation, works well in simulation. I’ve yet to use it when TPU is running on the FPGA with an external source, but I do not foresee many issues other than the one stated above.

A Look in the simulator

interrupts_waveform_numberedThe above waveform is showing an interrupt being flagged on a UART receive event, the event field containing the UART ID (1) and the byte value received (0x4f). Walking through the waveform, we get the following:

  1. The UART has received a byte and signaled this.
  2. An interrupt is immediately raised.
  3. Several cycles later the ACK is signaled by the cpu
  4. The interrupt event field(IEF) data is placed on the data in bus after a cycle of delay
  5. The ACKis de-signaled, and the IEF is removed from data in bus and saved internally (to later be used via the gief instruction)
  6. The CPU branches to the interrupt vector 0x0008, requesting the instruction from memory

The internal RAM

I mentioned previously that the design resources had shot up, and it turns out this is due mainly to the internal ram not being synthesized as a block ram. I was getting an internal compiler error in the Xilinx toolchain when building the existing ram with a larger capacity (I think it was 512bytes at this point) and to counter this I re-implemented the ram in another way. The way I did it, though, added an asynchronous element which in turn forced the toolchain to implement the RAM via look up tables, instead of utilizing the block ram. This is why there was a jump in resource requirements when using the Spartan6.

Block Rams

I could not get around the internal compiler error without an async element, so off to the documentation for the spartan6 I went. Turns out there is a document specifically on the block rams available on the device I have.

The block rams are used by initializing a generic object in VHDL to various constants, and then interfacing with the ports that object exposes. There are two kinds of block rams available, but I decided to use the 18 kilobit, dual-port one: RAMB16BWER. It is made up of 16Kb for data and 2Kb for parity. ISE has a nice template library for instantiation of primitives, and the block ram I use is included. It can be found within Edit->Language Templates, and then within the VHDL->Device Primitives->Spartan6->RAM/ROM.

lang_templatesThis brings up a window with initialization code to copy and paste into your own design. I took it, and edited the relevant areas to configure it for a 16-bit addressed memory.

Despite having the existing integrated ram address bytes explicitly, I decided against that with the block ram and instead addressed 16-bit values. To the TPU programmer, it still addresses bytes, but internally, it’s really stored at 16-bit, 2 byte blocks. The main reason for this was latency and complexity. By addressing 16-bit values internally in the block ram, I can implement both 16-byte reads/writes and also 8-bit reads and writes using a single port. The RAMB16BWER has a byte-wise write enable, so I can write either the high or low 8bits of a memory location internal to the block ram, leaving the other half untouched. There is one issue that arises from this method – an unaligned 16-bit read/write (i.e, the address being odd) will result in incorrect behavior. At the moment nothing happens if you try this, but I intend to add a trap/exception. I could maybe invoke the interrupt handler with a known interrupt event field value to specify an unaligned memory operation.


There were several gotchas I encountered whilst trying the block ram with a testbench. The addressing scheme, first of all, was confusing. As the generic component was initialized with relevant 16-bit addressing (18bit when you include parity), I assumed it would transform the address itself into the correct form. This did not seem to be the case after running the test bench. the documentation has a table of mappings and also a formula, but in the end it only took a few minutes of inspection in the simulator to work out what was happening.

blockramaddressThe next issue was a rather silly affair! The initialization attributes for the block ram are from most-significant to least-significant order. Due to this, 16-bit instructions need byte-flipped when read in the code, and also, they go from right to left along the initialization attribute.

-- BEGIN TASM RAMB16BWER INIT OUTPUT                                         
INIT_00 => X"06831180E27F00300000004F4C4C454801E102E100EF03E100000CC1E91E088E",

Maps to the instruction forms (only first 3 instructions shown):

X"8E", X"08", -- 0000: load.h  r7 0x08
X"1E", X"E9", -- 0002: subi    r7 r7 4
X"C1", X"0C", -- 0004: bi      0x0018

I will not admit the amount of time spent trying to figure out the issue of byte flipping in the initialization attribute ๐Ÿ˜‰

The least significant digit of the address, specifying the high/low byte of the 16-bit memory location, is managed in the VHDL process. Ive put that process (and other relevant signal operations) below for clarity. It’s a large block of text even without some of the less important generic attributes/initializations, which I have omitted.

 generic map (
    -- DATA_WIDTH_A/DATA_WIDTH_B: 0, 1, 2, 4, 9, 18, or 36
    DATA_WIDTH_A => 18,
    DATA_WIDTH_B => 18,
    -- SIM_COLLISION_CHECK: Collision check enable "ALL", "WARNING_ONLY", "GENERATE_X_ONLY" or "NONE" 
    -- SIM_DEVICE: Must be set to "SPARTAN6" for proper simulation behavior
 port map (
    -- Port A Data: 32-bit (each) output: Port A data
    DOA => DOA,       -- 32-bit output: A port data output
    DOPA => DOPA,     -- 4-bit output: A port parity output
    -- Port B Data: 32-bit (each) output: Port B data
    DOB => DOB,       -- 32-bit output: B port data output
    DOPB => DOPB,     -- 4-bit output: B port parity output
    -- Port A Address/Control Signals: 14-bit (each) input: Port A address and control signals
    ADDRA => ADDRA,   -- 14-bit input: A port address input
    CLKA => CLKA,     -- 1-bit input: A port clock input
    ENA => ENA,       -- 1-bit input: A port enable input
    REGCEA => REGCEA, -- 1-bit input: A port register clock enable input
    RSTA => RSTA,     -- 1-bit input: A port register set/reset input
    WEA => WEA,       -- 4-bit input: Port A byte-wide write enable input
    -- Port A Data: 32-bit (each) input: Port A data
    DIA => DIA,       -- 32-bit input: A port data input
    DIPA => DIPA,     -- 4-bit input: A port parity input
    -- Port B Address/Control Signals: 14-bit (each) input: Port B address and control signals
    ADDRB => ADDRB,   -- 14-bit input: B port address input
    CLKB => CLKB,     -- 1-bit input: B port clock input
    ENB => ENB,       -- 1-bit input: B port enable input
    REGCEB => REGCEB, -- 1-bit input: B port register clock enable input
    RSTB => RSTB,     -- 1-bit input: B port register set/reset input
    WEB => WEB,       -- 4-bit input: Port B byte-wide write enable input
    -- Port B Data: 32-bit (each) input: Port B data
    DIB => DIB,       -- 32-bit input: B port data input
    DIPB => DIPB      -- 4-bit input: B port parity input

 -- End of RAMB16BWER_inst instantiation

--todo: assertion on non-aligned 16b read?

CLKA <= I_clk;
CLKB <= I_clk;

ENA <= I_cs;
ENB <= '0';--port B unused

ADDRA <= I_addr(10 downto 1) & "0000";

process (I_clk, I_cs)
  if rising_edge(I_clk) and I_cs = '1' then
    if (I_we = '1') then
      if I_size = '1' then
        -- 1 byte
        if I_addr(0) = '1' then
          WEA <= "0010";
          DIA <= X"0000" & I_data(7 downto 0) & X"00";
          WEA <= "0001";
          DIA <= X"000000" & I_data(7 downto 0);
        end if;
        WEA <= "0011";
        DIA <= X"0000" & I_data(7 downto 0)& I_data(15 downto 8);
      end if;
      WEA <= "0000";
      WEB <= "0000";
      if I_size = '1' then
        if I_addr(0) = '0' then
          data(15 downto 8) <= X"00";
          data(7 downto 0)  <= DOA(7 downto 0);
          data(15 downto 8) <= X"00";
          data(7 downto 0)  <= DOA(15 downto 8);
        end if;
        data(15 downto 8) <= DOA(7 downto 0);
        data(7 downto 0) <= DOA(15 downto 8);
      end if;
    end if;
  end if;
end process;

O_data <= data when I_cs = '1' else "ZZZZZZZZZZZZZZZZ";

Assembler Output

The last thing to do was to add another output file generator to TASM, my c# TPU assembler. This simply outputs the whole 2KB initialization table for the input assembly. It’s then just copy/pasted into the VHDL in the appropriate attribute location.

Wrapping up

That’s it for this part. I really hope to have the next part with TPU talking to a peripheral device (and some changes to the ISA) in the next week or two. Fingers crossed!

Thanks for reading, comments as always to @domipheus.


Designing a CPU in VHDL, Part 9: Byte addressing, memory subsystem and UART

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing. This part is heavy going if you’ve not read the previous posts.

Byte Addressing

TPU currently operates with memory by addressing 16-bit words. It’s a fairly common set-up for custom processors (addressing non-‘byte’ sizes, that is), but I wanted byte addressing as it simplifies some assembly for operations and really shouldn’t be that difficult to add. There are various things that need to change:

  • The PC needs to increment by 2 each instruction cycle, not 1
  • The read and write memory instructions need to have a size flag
  • The assembler needs to now calculate offsets knowing instructions are 2 bytes and now 2 memory addresses wide
  • Our embedded RAM needs to be able to perform operations only on byte-widths, and also address as bytes.

The PC increment is trivial, simply changing the increment operator of the PC unit to add 2 instead of one. The embedded ram changes are not so bad either. We add a new input port to say whether or not the current operation is on bytes or words. When we are reading just a byte, we want to zero out the high byte of the output port and write just the low byte from memory. For a word operation, we perform two memory/ram reads and write them into the low and high parts of the output. It’s worth noting what we are doing here is technically big-endian; in that, when we do a read from a byte address, the most-significant byte is located at that address, followed by the least significant byte. I also added a chip select, which ‘disconnects’ the output when not enabled.

type store_t is array (0 to 103) of std_logic_vector(7 downto 0);
signal ram: store_t := ...

... snip ...

process (I_clk, I_cs)
  if rising_edge(I_clk) and I_cs = '1' then
    if (I_we = '1') then
      if I_size = '1' then
        -- 1 byte
        ram(int_addr) <= I_data(7 downto 0);
        ram(int_addr) <= I_data(15 downto 8);
        ram(int_addr+1) <= I_data(7 downto 0);
      end if;
      if I_size = '1' then
        data(15 downto 8) <= X"00";
        data(7 downto 0)  <= b1;
        data(15 downto 8) <= b1;
        data(7 downto 0)  <= b2;
      end if;
    end if;
  end if;
end process;

int_addr <= to_integer(unsigned(I_addr(7 downto 0)));

b1 <= ram(int_addr);
b2 <= ram(int_addr+1);

O_data <= data when I_cs = '1' else "ZZZZZZZZZZZZZZZZ";

The embedded ram gave me some problems when it came to synthesis – I had an internal compiler error in the Xilinx tools. I narrowed this down to a single line, and then a single token – one of the boundaries of a (X downto Y) statement. I re-wrote this component to get around the issue, and it also made me realise that this method of implementing the ram may be inefficient in terms of using the rams on-device. The version listed above is the new, re-written version. You can see there are multiple reads and multiple writes each cycle. I’ll need to look into how the internal block rams of the Spartan6 are used to make sure I’m not causing issues.

For the read and write instructions, I’d conveniently left a single bit of the instruction form free for later use. This is the ‘flag bit’ in position 8. When this bit is set, it instructs the memory system to do a byte operation instead of a word operation.

ISAmemoryreadThe changes to the assembler were pretty trivial; enable the read and write instructions for byte operations using the flag bit, and the changes required for byte addressing when calculating label offsets. With those changes, everything worked well and I was able to progress. An interesting point to note now is that I could enforce instructions being on 2-byte boundaries, which would mean all immediate branch offsets could be increased by an extra bit as they are currently. I’ve not done this, yet, but I likely will.

Memory Subsystem

I’d always intended to extend the memory subsystem of TPU as to be able to interface with memories that had different latencies. Currently, it’s fixed that memory takes a single cycle to respond. I wanted to expose an interface that would enable connecting up other memory-mapped devices (like a UART) or even enable the interfacing of SDRAM on the miniSpartan6+ board.

For this I’ve created a ‘core’ component which brings everything bar the embedded ram together in a single object. The memory is handled by an internal controller, which works as a state machine and is triggered by a command port. The state machine ports are exposed like the following.

entity mem_controller is
  Port( I_clk : in  STD_LOGIC;
        I_reset : in STD_LOGIC;

        O_ready : out STD_LOGIC;
        I_execute: in STD_LOGIC;
        I_dataWe : in  STD_LOGIC;
        I_address : in  STD_LOGIC_VECTOR (15 downto 0);
        I_data : in  STD_LOGIC_VECTOR (15 downto 0);
        I_dataByteEn : in STD_LOGIC_VECTOR(1 downto 0);
        O_data : out  STD_LOGIC_VECTOR (15 downto 0);
        O_dataReady: out STD_LOGIC;

        MEM_I_ready: in STD_LOGIC;
        MEM_O_cmd: out STD_LOGIC;
        MEM_O_we : out  STD_LOGIC;
        MEM_O_byteEnable : out STD_LOGIC_VECTOR (1 downto 0);
        MEM_O_addr : out  STD_LOGIC_VECTOR (15 downto 0);
        MEM_O_data : out  STD_LOGIC_VECTOR (15 downto 0);
        MEM_I_data : in  STD_LOGIC_VECTOR (15 downto 0);
        MEM_I_dataReady : in STD_LOGIC
end mem_controller;

In this component we have the internal interface, and then the external interface (prefixed with “MEM_”). The order of operations is as follows:

  1. Wait until O_ready is active
  2. Set the various inputs: address, in data or write enable, and the dataByteEn – this enables 16 or 8-bit memory modes.
  3. Signal the I_execute input until O_ready goes inactive.

This then gets routed to the external ports. I wanted to have something like this within the core object in case the need arose for implementing segmentation or other types of memory addressing extensions. Then the external address buses would be wider, supplemented by a register somewhere to set the high order bits. The actual code for the memory_controller is trivial just now:

architecture Behavioral of mem_controller is
  signal we : std_logic := '0';
  signal addr : STD_LOGIC_VECTOR (15 downto 0) := X"0000";
  signal indata: STD_LOGIC_VECTOR (15 downto 0) := X"0000";
  signal byteEnable: STD_LOGIC_VECTOR ( 1 downto 0) := "11";
  signal cmd : STD_LOGIC := '0';
  signal state: integer := 0;

  process (I_clk, I_execute)
    if rising_edge(I_clk) then
      if I_reset = '1' then
        we <= '0';
        cmd <= '0';
        state <= 0;
      elsif state = 0 and I_execute = '1' and MEM_I_ready = '1' then
        we <= I_dataWe;
        addr <= I_address;
        indata <= I_data;
        byteEnable <= I_dataByteEn;
        cmd <= '1';
        O_dataReady <= '0';
        if I_dataWe = '0' then
          -- read
          state <= 1;
          state <= 2;-- write
        end if;
      elsif state = 1 then
        cmd <= '0';
        if MEM_I_dataReady = '1' then
          O_dataReady <= '1';
          state <= 2;
        end if;
      elsif state = 2 then
        cmd <= '0';
        state <= 0;
        O_dataReady <= '0';
      end if;
    end if;
  end process;
  O_ready <= ( MEM_I_ready and not I_execute ) when state = 0 else '0';
  MEM_O_cmd <= cmd;
  O_data <= MEM_I_data;
  MEM_O_byteEnable <= byteEnable;
  MEM_O_data <= indata;
  MEM_O_addr <= addr;
  MEM_O_we <= we;

end Behavioral;

The main point this serves in terms of TPU is that the control unit uses the output signals from the memory controller at the point when deciding to move to the next stage of the pipeline. This means that stages such as the fetch stage, and the memory stage, wait until the memory subsystem has indicated a command has completed:

For a write, we know the command has executed when the O_ready output goes back to active after deactivating after the signalling of I_execute.
For a read, we know the command has executed when the data we requested presents itself on the O_data line, with O_dataReady signalling the data on the line is valid for the previous request.

The controller does not have any buffering or a queue of operations, so everything waits for the memory system before continuing. At the moment, this is what I want as it makes debugging the system so much easier when things go wrong.

The ‘Core’

Our core TPU object looks like the following:

entity core is
  Port (I_clk : in  STD_LOGIC;
        I_reset : in  STD_LOGIC;
        I_halt : in  STD_LOGIC;

        -- memory interface
        MEM_I_ready : IN  std_logic;
        MEM_O_cmd : OUT  std_logic;
        MEM_O_we : OUT  std_logic;
        MEM_O_byteEnable : OUT  std_logic_vector(1 downto 0);
        MEM_O_addr : OUT  std_logic_vector(15 downto 0);
        MEM_O_data : OUT  std_logic_vector(15 downto 0);
        MEM_I_data : IN  std_logic_vector(15 downto 0);
        MEM_I_dataReady : IN  std_logic
end core;

There are a few signals yet to add here; for one, there are no interrupts yet – something I’d like to add. But as you can see, the core TPU object now just exposes the memory interface, along with the clock and some control. If you apply the clock, a 16-bit request to read address 0 will be asserted, as it attempts to fetch it’s first instruction to execute.

In making a top-level module which we can flash to the FPGA, we need to have one of the cores, and an instance of our embedded ram. We also need to have some sort of logic which can handle the memory system commands from the core – and either let the embedded RAM service it, or some other component – such as our UART.

core_1: core PORT MAP (
  I_clk => I_clk,
  I_reset => I_reset,
  I_halt => I_halt,
  MEM_I_ready => MEM_I_ready,
  MEM_O_cmd => MEM_O_cmd,
  MEM_O_we => MEM_O_we,
  MEM_O_byteEnable => MEM_O_byteEnable,
  MEM_O_addr => MEM_O_addr,
  MEM_O_data => MEM_O_data,
  MEM_I_data => MEM_I_data,
  MEM_I_dataReady => MEM_I_dataReady

ebram_1: ebram Port map ( 
  I_clk => I_clk,
  I_cs => CS_ERAM,
  I_we => MEM_O_we,
  I_addr => MEM_O_addr,
  I_data => MEM_O_data,
  I_size => ram_req_size,
  O_data => ram_output_data

uart_1: uart_simple PORT MAP (
  I_clk => I_clk,
  I_clk_baud_count => I_clk_baud_count,-- 0x1100 -- R/W
  I_reset => I_reset,             
  I_txData => I_txData,                -- 0x1102  -- W
  I_txSig => I_txSig,                  -- 0x1103  -- W
  O_txRdy => O_txRdy,                  -- 0x1104  -- R
  O_tx => O_tx,
  I_rx => I_rx,
  I_rxCont => I_rxCont,                -- 0x1105  -- W
  O_rxData => O_rxData,                -- 0x1106  -- R
  O_rxSig => O_rxSig,                  -- 0x1107  -- R
  O_rxFrameError => O_rxFrameError     -- 0x1108  -- R

In addition to the above parts of the top-level design, there is the LED byte output which ends up mapping to the LEDS on the miniSpartan6+ board. those are written to at location 0x1000, and looking at the uart_1 definition above, the comments indicate which other signals are mapped and at what location. Many of these are 1-bit signals, but I use a whole byte for simplicity.

Whilst the inputs to the embedded ram is connected directly from the MEM_signals, the output (O_data) is mapped to another signal, as any address above 0x1000 I have defined for now as not existing within the embedded RAM.

ram_req_size <= '1' when MEM_O_byteEnable = "10" else '0';
CS_ERAM <= '1' when MEM_O_addr < X"1000" else '0';
MEM_I_data <= ram_output_data when CS_ERAM = '1' else IO_DATA;

ram_output_data is selected by the CPU core appropriately, and there is an IO_DATA signal for any other memory-mapped device. Additionally, ram_req_size is used to translate between the byte enable signal (2 bits corresponding to byte locations) to the 8/16-bit embedded ram input port input.

Now, there is a process which manages the MEM_cmd and ready signals, and forwards/directs data where it needs to go depending on those commands. I’ll paste the whole thing, but it’s a bit hacky and is my first attempt – I’ll probably end up reimplementing it when I need SDRAM integration.

MEM_proc: process(I_clk)
    if rising_edge(I_clk) then
      if MEM_readyState = 0 then
        if MEM_O_cmd = '1' then
          -- LEDS memory map
          if MEM_O_addr = X"1000" then
            -- leds
            IO_LEDS <= MEM_O_data( 7 downto 0);
          end if;
          -- UART1 memory map
          case MEM_O_addr is
            when X"1100" =>
              if MEM_O_we = '0' then
                IO_DATA <= I_clk_baud_count;
                I_clk_baud_count <= MEM_O_data;
              end if;
            when X"1102" =>
              if MEM_O_we = '1' then
                I_txData <= MEM_O_data(7 downto 0);
              end if;
            when X"1103" =>
              if MEM_O_we = '1' then
                I_txSig <= MEM_O_data(0);
              end if;
            when X"1104" =>
              if MEM_O_we = '0' then
                IO_DATA <= X"000" & "000" & O_txRdy;
              end if;
            when X"1105" =>
              if MEM_O_we = '1' then
                I_rxCont <= MEM_O_data(0);
              end if;
            when X"1106" =>
              if MEM_O_we = '0' then
                IO_DATA <= X"00" & O_rxData;
              end if;
            when X"1107" =>
              if MEM_O_we = '0' then
                IO_DATA <= X"000" & "000" & O_rxSig;
              end if;
            when X"1108" =>
              if MEM_O_we = '0' then
                IO_DATA <= X"000" & "000" & O_rxFrameError;
              end if;
            when others =>
          end case;
          MEM_I_ready <= '0';
          MEM_I_dataReady  <= '0';
          if MEM_O_we = '1' then
            MEM_readyState <= 2;
            MEM_readyState <= 1;
          end if;
        end if;
      elsif MEM_readyState >= 1 then
        if MEM_readyState = 1 then
          MEM_I_ready <= '1';
          MEM_I_dataReady  <= '1';
          MEM_readyState <= 0;
        elsif MEM_readyState = 2 then
          MEM_I_ready <= '1';
          MEM_I_dataReady  <= '0';
          MEM_readyState <= 0;
          MEM_readyState <= MEM_readyState + 1;
        end if;
      end if;
    end if;
  end process;

As we disable the embedded ram when the address is >= 0x1000, we don’t need to worry about overwriting memory when looking for a memory mapped device. The LEDS are mapped with a simple if block, and the UART with a case. With this, the TPU core can access memory, write status to LEDS and also use the UART by reading and writing known memory locations.

We can simulate it with a simple test and it works well.

top_module_tbUsing the UART

The UART itself is pretty much exactly the same one in my previous post on the subject. Ive fixed a few issues in the original code (some of which were mentioned in the article). All the ports are memory mapped, so to use the component from TPU assembly, we just read and write to those addresses. This is where 1) the byte addressing mode and 2) the immediate offset of memory addresses in the new read/write instructions help a lot. The general method for transmitting a byte over the UART is as follows:

  1. Wait until the ready bit is set
  2. Set the data input value
  3. Set the transmit signal bit
  4. Wait until the device signals not ready (i.e, working on a request)
  5. Unset the transmit signal bit

For receive, it’s similar:

  1. Ensure the RX is enabled
  2. Wait until the receive signal bit becomes active
  3. Read the output data value
  4. Wait until the receive signal bit de-asserts

In addition, the baud rate can be set by adjusting a counter value which is memory mapped. If you want a baud rate of 9600, you’d set the value to 5208 – which is the main system clock (50MHz) divided by the bit-rate required (50000000/9600).

In TPU assembly, the send and receive functions look simple enough.

  load.l  r1, 1
  load.h  r2, 0x11              #uart1 memory mapped offset r2=0x1100
  read.b  r3, r2, 4             #readb [r2+4] (tx_ready) 
  cmp.u   r3, r3, r3            #compare  r3, %us_waitready     #loop until tx_ready nonzero
  write.b r2, r0, 2             #write txdata register
  write.b r2, r1, 3             #set txsig register 1
  read.b  r3, r2, 4             
  cmp.u   r3, r3, r3 r3, %us_waitunready   #loop until tx_ready zero
  load.l  r1, 0
  write.b r2, r1, 3
  br r6

  load.l  r1, 1
  load.h  r2, 0x11
  write.b r2, r1, 5
  read.b  r3, r2, 7             
  cmp.u   r3, r3, r3  r3, %ur_waitsig    #loop until rx_sig nonzero
  read.b  r0, r2, 6          #read data into r0
  read.b  r3, r2, 7             
  cmp.u   r3, r3, r3 r3, %ur_waitunsig  #loop until rx_sig nonzero
  br r6

In these examples, the ‘functions’ are called with any argument in r0, the return value placed in r0, and the return PC to jump back to in r6. You can send and ‘H’ over the UART using:

  load.l r0, 0x48
  load.l r6, $L_E
  bi $uart_send
  ... snip ...

And receive sets r0 on return to the value sent across the connection.

I’m still assembling these programs and initializing the embedded RAM with those instructions at present; but this UART now gives the possibility of having a fixed bootloader which loads a binary blob from the serial port and then executes it. The assembled instructions can be output in VHDL for easy integration. Here is how an example which prints ‘HELLO’ after receiving a byte (which is discarded) and then continues to echo everything received back to the sender looks.

X"8D", X"04", -- 0000: load.l  r6 0x0004
X"C1", X"4E", -- 0002: bi      0x004e
X"81", X"48", -- 0004: load.l  r0 0x48
X"8D", X"0A", -- 0006: load.l  r6 0x000a
X"C1", X"34", -- 0008: bi      0x0034
X"81", X"45", -- 000A: load.l  r0 0x45
X"8D", X"10", -- 000C: load.l  r6 0x0010
X"C1", X"34", -- 000E: bi      0x0034
X"81", X"4C", -- 0010: load.l  r0 0x4c
X"8D", X"16", -- 0012: load.l  r6 0x0016
X"C1", X"34", -- 0014: bi      0x0034
X"81", X"4C", -- 0016: load.l  r0 0x4c
X"8D", X"1C", -- 0018: load.l  r6 0x001c
X"C1", X"34", -- 001A: bi      0x0034
X"81", X"4F", -- 001C: load.l  r0 0x4f
X"8D", X"22", -- 001E: load.l  r6 0x0022
X"C1", X"34", -- 0020: bi      0x0034
X"81", X"0D", -- 0022: load.l  r0 0x0d
X"8D", X"28", -- 0024: load.l  r6 0x0028
X"C1", X"34", -- 0026: bi      0x0034
X"81", X"00", -- 0028: load.l  r0 0
X"8D", X"2E", -- 002A: load.l  r6 0x002e
X"C1", X"4E", -- 002C: bi      0x004e
X"8D", X"28", -- 002E: load.l  r6 0x0028
X"C1", X"34", -- 0030: bi      0x0034
X"C1", X"64", -- 0032: bi      0x0064
X"83", X"01", -- 0034: load.l  r1 1
X"84", X"11", -- 0036: load.h  r2 0x11 #uart1 memory mapped offset
X"67", X"44", -- 0038: read.b  r3 r2 4 
X"96", X"6C", -- 003A: cmp.u   r3 r3 r3
X"D3", X"7C", -- 003C:  r3 -4   #loop until tx_ready nonzero
X"70", X"42", -- 003E: write.b r2 r0 2 #write txdata register
X"70", X"47", -- 0040: write.b r2 r1 3  
X"67", X"44", -- 0042: read.b  r3 r2 4  
X"96", X"6C", -- 0044: cmp.u   r3 r3 r3
X"D7", X"7C", -- 0046: r3 -4  
X"83", X"00", -- 0048: load.l  r1 0
X"70", X"47", -- 004A: write.b r2 r1 3
X"C0", X"C0", -- 004C: br      r6
X"83", X"01", -- 004E: load.l  r1 1
X"84", X"11", -- 0050: load.h  r2 0x11
X"72", X"45", -- 0052: write.b r2 r1 5
X"67", X"47", -- 0054: read.b  r3 r2 7
X"96", X"6C", -- 0056: cmp.u   r3 r3 r3
X"D3", X"7C", -- 0058:  r3 -4 
X"61", X"46", -- 005A: read.b  r0 r2 6  
X"67", X"47", -- 005C: read.b  r3 r2 7
X"96", X"6C", -- 005E: cmp.u   r3 r3 r3
X"D7", X"7C", -- 0060: r3 -4 
X"C0", X"C0", -- 0062: br      r6
X"C1", X"64", -- 0064: bi      0x0064
X"00", X"00"  -- 0066: dw      0x0000

Using the miniSpartan6+ FTDI chip

One thing mentioned in the post about my own UART implementation is how I used a Teensy3.1 / external USB->serial TTL cable to interface with the FPGA.

I was contacted on twitter and made aware that the FTDI USB chip on the miniSpartan6+ is dual channel, and as well as interfacing with JTAG to flash the FPGA there is an additional COM port exposed through USB, and a TX/RX line connected to the FPGA. By connecting the TX and RX lines of my top level design to these pins, I can use the onboard USB to communicate with TPU!

This is done by assigning the external TX and RX ports of TPU to the pins on the Spartan6 that are connected to the FTDI USB chip. In the UCF constraints file:


Will expose those pins as channel B of the FTDI chip. It seems to communicate at 115200 baud, and in my tests it works well. Executing the TPU assembly above, with this USB->TPU setup, we can connect to the com port and communicate:

putty_helloThat ‘HELLO’ is hella-simple, but it shows a memory-mapped peripheral interface working with TPU, which is quite the milestone!

The State of TPU

TPU now has a decent set of instructions, and the ability to call functions, set up stacks, perform arithmetic and operate memory-mapped peripherals. The current top-level view of the system is below:

arch_overview_1You can see we have some embedded RAM, our UART, and also a memory-mapped register for setting the LEDS on the miniSpartan6+ as well as reading the on-board switches. The core is our new component that simply exposes our memory interface. It’s a bit more than just a CPU, but we need all this to get the system working!

floorplanAt the moment we’re using more of the FPGA than I’d like – mostly due to the UART. It takes a real chunk out, bringing utilization up to 33%. I don’t really mind this, as I know there is a lot of stuff implemented in very bad ways. So I’m not worried – this value will fall.

I’m not quite ready to update the Github with what’s here (this was very rushed) but the new ISA can be found here.

That about wraps up this post. I’m currently looking into interfacing something with the TPU via an additional UART which will be fun (and fairly ridiculous, really) – so there is that to look forward to!

Thanks for reading, comments as always to @domipheus.


Designing a CPU in VHDL, Part 8: Revisiting the ISA, function calling, assembler

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

We’re at the point now where the CPU can run some more involved examples. The examples we’ve run to date on the simulator have been fairly simple, and more to the point, tailored to what we have available. I wanted to take a look back at the ISA, to see where we can make some worthwhile changes before moving forward.

Our more complex example code

Trivial 16-bit multiply!

It’s incredibly simple, again. But, that’s because we are missing some pretty fundamental functionality from the TPU. Even this tiny example exposes them.

The example I came up with is as follows:

  1. nominate a register for a stack location and set it.
  2. Set up a simple stack frame to execute a multiply function which takes two 16bit operands.
  3. Call the ‘mul16’ function
  4. in mul16()
    1. grab arguments from the stack
    2. perform the multiplication
    3. return our result in r0
  5. perform some sort of jump away to a safe place of code where we halt using an infinite loop.

This example, in code form, is similar to this:

ushort mul16( ushort a, ushort b)
  ushort sum = 0;
  while (b != 0)
    sum += a;
  return sum;

  ushort ret = mul16(3,7);
  while(1) {
    ret |= ret;

For this example, I defined r7 as the stack register. It was set to the top of our embedded ram block, and the stack will grow downwards. We need to store the two mul16 parameters, as well as our return address. As we address 16 bit words instead of the more typical 8-bit bytes, we only subtract 3 from the current stack pointer value. We then need to write in at various offsets our parameters:

sp = return PC
sp+1 = ushort a
sp+2 = ushort b

The first thing to notice is we are writing these values to constant offsets of a register value r7 (our SP). At the moment, our ISA only has a write to an address which is located in a register, so we need to perform writes and additions to a temporary register, or, we implement new functionality into TPU

Reads and Writes to memory with offset

Currently our write instruction takes a destination memory address specified in rA and a value to write specified in rB. The Read memory instruction is similar, but uses rD for the destination register, and rA as the address. This is due to rD being the only internal data select path into the register file.

Looking at the old instruction forms we have various unused bits that are enough to hold a significant offset value for our memory operations. In the case of the write instruction, these bits are non-contiguous, but we can solve that in the decoder. Our new read instruction looks like the following.

readWith our write instruction a little less clear coming in at


This is when having the immediate data output from the decoder 16-bits becomes useful. We extend the decoder to make those top 8 bits dependant on the instruction opcode, so that when a write is decoded, the immediate offset value is recombined ready for use by the ALU.

  O_dataIMM(15 downto 8) <= I_dataInst(IFO_RD_BEGIN downto IFO_RD_END)
            & I_dataInst(IFO_F2_BEGIN downto IFO_F2_END) & "000";
  O_regDwe <= '0';

The changes to the ALU are minimal, and we just do the inefficient thing of adding another adder. Knowing from the previous part that TPU currently takes up a tiny 3% of the Spartan6 LX25 resources, we can concentrate on getting functionality in rather than optimizing for space.

  -- The result is the address we want.
  -- First 5 bits of the Imm value is an offset.
  s_result(15 downto 0) <= std_logic_vector(signed(I_dataA) + signed(I_dataIMM(15 downto 11)));
  s_shouldBranch <= '0';
  -- The result is the address we want.
  -- Last 5 bits of the Imm value is an offset.
  s_result(15 downto 0) <= std_logic_vector(signed(I_dataA) + signed(I_dataIMM(4 downto 0)));
  s_shouldBranch <= '0';

You can see the ALU code is very similar. We treat the 5-bit immediate as a signed value, as [-16, 15] is a wide enough range of offsets, and being able to offset back as well as forward will come in very handy.

Calling Functions

Getting back to our example, we need to store the program location that we need to return to after executing our mul16 function. Amazingly, we didn’t have an instruction for getting the current PC, so this was impossible. It was very easy to add, though. The current PC is forwarded to the ALU – just use one of the two reserved opcodes we have free to define a set of special state operations.

spc_sstatusThe ALU code to serve these instructions is trivial.

when OPCODE_SPEC => 	-- special
  case I_dataIMM(IFO_F2_BEGIN downto IFO_F2_END) is
    when OPCODE_SPEC_F2_GETPC =>
      s_result(15 downto 0) <= I_PC;
       s_result(1 downto 0) <= s_result(17 downto 16);
    when others =>
  end case;
  s_shouldBranch <= '0';

The sstatus, or get status instruction, will be used to get overflow and carry status bits – which currently are not implemented.

Now that we can get the current PC value, we can use this to calculate the return address for our callee function to jump to on return. The assembly looks as follows.

  load.l  r7, 0x27    # Top of the stack
  load.l  r1, 7       # constant argument 2
  load.l  r2, 3       # constant argument 1
  subi    r7, r7, 3   # reserve 3 words of stack
  write   r7, r1, 2   # write argument at offset +2
  write   r7, r2, 1   # write argument at offset +1
  spc     r6          # get current pc
  addi    r6, r6, 4   # offset to after the call
  write   r7, r6      # put return PC on stack
  bi      $mul16      # call
  addi    r7, r7, 3   # pop stack

This creates a call stack for mul16 containing it’s two parameters, and the location of where it should branch to when it returns.

Immediate arithmetic

You may have noticed two new instructions in the above code snippet – addi and subi. These were added to account for the fact simply incrementing/decrementing registers needed an immediate load, which then used up one of our registers.

The add and sub instructions both have two unused flag bits, so one of them was used to signal intermediate mode. In this mode, rD and rA are used as normal, but rB is disregarded, and 5-bits are used to represent an unsigned immediate value.

addiI took the decision to use only unsigned versions of this instruction, as I thought if someone was really interested in proper overflow detection, they wouldn’t mind taking the additional register penalty, and use the existing add instruction using a register.

In the VHDL, I again didn’t care about resources, and simply added yet another if conditional with adders.

when OPCODE_ADD =>
  if I_aluop(0) = '0' then
    if I_dataImm(0) = '0' then
      s_result(16 downto 0) <= std_logic_vector(unsigned('0' & I_dataA) + unsigned( '0' & I_dataB));
      s_result(16 downto 0) <= std_logic_vector(unsigned('0' & I_dataA) + unsigned( '0' & X"000" & I_dataIMM(4 downto 1)));
    end if;
    s_result(16 downto 0) <= std_logic_vector(signed(I_dataA(15) & I_dataA) + signed( I_dataB(15) & I_dataB));
  end if;
  s_shouldBranch <= '0';

The last 8 bits in dataImm always contain the last 8 bits of our instruction word, so we just use that for both the immediate mode check and then for the 5 bits of value itself.

The mul16 Function

Lets recap the C style version of our function:

ushort mul16( ushort a, ushort b)
  ushort sum = 0;
  while (b != 0)
    sum += a;
  return sum;

And in the TPU assembly written so far, our stack pointed to by r7 resembles the following:

stackThe assembly code therefore, for the mul16 function, is as follows.

  read    r1, r7, 2
  read    r2, r7, 1
  load.l  r0, 0
  cmp.u   r5, r2, r2  r5, %mul16_fin
  add     r0, r0, r1
  subi.u  r2, r2, 1
  bi      $mul16_loop
  read    r6, r7, 0
  br      r6

Pretty simple stuff, but again – a new instruction! = branch to relative offset when A is zero.

Conditional Branch to relative offset

If you remember our previous parts discussing the conditional branching, and even our first part, you’ll remember that they could only branch to a target stored in a register. It was incredibly inefficient for small loops, taking up a register and bloating the code.

Before implementing relative offset branching, there was a need to make the conditional branching instructions more sane. The conditional bits in the instruction which form the type of condition were split and spread out in the instruction form, despite us not using the rD bits. This was changed, so we have a new instruction coding for conditional jumps:

bcondWith this now done, adding relative branch targets was fairly simple. The flag bit (8) is used to detect whether we branch to a register value or an immediate offset from the current PC:

broThe VHDL checks for the flag bit, and selects a different branch target.

  if I_aluop(0) = '1' then
     s_result(15 downto 0) <= std_logic_vector(signed(I_PC) + signed(I_dataIMM(4 downto 0)));
    s_result(15 downto 0) <= I_dataB;
  end if;

You can see the 5-bit immediate is signed, allowing conditional jumps backwards in the instruction stream. As any TIS-100 player will know, JRO’s backwards are very useful – especially in a multiplier ๐Ÿ˜‰

The full multiplier test

I’ve put the full multiplier assembly listing below, which is bulky but I think helps in understanding the flow.

  load.l  r7, 0x27    # Top of the stack
  load.l  r1, 7       # constant argument 1
  load.l  r2, 3       # constant argument 2
  subi    r7, r7, 3   # reserve 3 words of stack
  write   r7, r1, 2   # write argument at offset +2
  write   r7, r2, 1   # write argument at offset +1
  spc     r6          # get current pc
  addi    r6, r6, 4   # offset to after the call
  write   r7, r6      # put return PC on stack
  bi      $mul16     # call
  addi    r7, r7, 3   # pop stack
  bi      $end

# Multiply two u16s. Doesn't check for overflow.
  read    r1, r7, 2
  read    r2, r7, 1
  load.l  r0, 0
  cmp.u   r5, r2, r2  r5, %mul16_fin
  add     r0, r0, r1
  subi.u  r2, r2, 1
  bi      $mul16_loop
  read    r6, r7, 0
  br      r6

  bi     $halt

  or     r0,r0,r0
  bi     $end

If this test works, we should be able to see r0 containing the result of our multiply (21 or 0x15) and the waveform should show the shouldBranch signal oscillating due to the end jump over an or. If shouldBranch is high at all times, we know we’ve hit halt so something isn’t quite right. I’ve not done typical calling convention things such as saving out volatile registers, but it’s easy to see how that would work. But i’m sure those reading by now will be wondering how I get those assembly listings into my test benches in VHDL.

The TPU Assembler – TASM

I have written a 1-file assembler in c# for the current ISA of TPU. In it’s thousand lines of uncommented splendour lies an abundance of coding horrors – fit for the Terrible Processing Unit. It works perfectly well for what I want – just don’t look too deep into it.

I wrote this in a few hours early on in the project, because as you can imagine, writing out instructions forms manually is tedious. The assembler is very simple and is fully self contained without any dependencies. It contains definitions for instructions, how to parse instruction forms, and how to write out their binary representation.

The functional flow for the assembler is as follows:

  1. Parse arguments and open input file
  2. for each line in the input file
    1. if it starts with a ‘#’, ignore it as a comment.
    2. split the line into strings by whitespace and commas
    3. If the first element ends with a ‘:’ treat it as a label and note it’s location
    4. Add the rest as instruction definitions to a list of inputs
  3. For each input definition, replace label names with actual values
  4. parse all definitions into a list of Operation Data objects
  5. Open output file
  6. Output the instruction data using a particular format generator

Assembler Features

The assembler accepts instruction mnemonics as per the ISA document, but will accept some additional ones – like add, which is simply treated as add.u.

There is a data definition (data/dw) which outputs 16-bit hex values directly to the instruction stream, it accepts outputting labels as absolute ($ prefix) and relative (% prefix), but does not currently support the ability to set the current location in memory of definitions – the first line is location 0x0000, and it continues from there.

Errors are not handled gracefully, and there is no real input checking. You could pass a relative offset into a conditional branch which is outside of the bounds of the instruction, and it will generate incorrect code. I’ll fix this stuff at a later date.

Output from the assembler is either binary, hex, or ‘eram’. The Embedded Ram (eram) format is basically VHDL initialization, with the original listing and offsets as comments. The example above assembles to the following:

X"8F27", -- 0000: load.l  r7 0x27 # Top of the stack
X"8307", -- 0001: load.l  r1 7 # constant argument 1
X"8503", -- 0002: load.l  r2 3 # constant argument 2
X"1EE7", -- 0003: subi    r7 r7 3 # reserve 3 words of stack
X"70E6", -- 0004: write   r7 r1 2 # write argument at offset +2
X"70E9", -- 0005: write   r7 r2 1 # write argument at offset +1
X"EC00", -- 0006: spc     r6 # get current pc
X"0CC9", -- 0007: addi    r6 r6 4 # offset to after the call
X"70F8", -- 0008: write   r7 r6 # put return PC on stack
X"C10C", -- 0009: bi      0x000c # call
X"0EE7", -- 000A: addi    r7 r7 3 # pop stack
X"C117", -- 000B: bi      0x0017
X"62E2", -- 000C: read    r1 r7 2
X"64E1", -- 000D: read    r2 r7 1
X"8100", -- 000E: load.l  r0 0
X"9A48", -- 000F: cmp.u   r5 r2 r2
X"D3A4", -- 0010:  r5 4
X"0004", -- 0011: add     r0 r0 r1
X"1443", -- 0012: subi.u  r2 r2 1
X"C10F", -- 0013: bi      0x000f
X"6CE0", -- 0014: read    r6 r7 0
X"C0C0", -- 0015: br      r6
X"C116", -- 0016: bi      0x0016
X"2000", -- 0017: or      r0 r0 r0
X"C117", -- 0018: bi      0x0017

And this is simply pasted into our VHDL ram objects. We need to pad it out to the correct size of the ram – but that is something I want to add as a feature, so you pass in the size of the eRAM and it automatically initializes the rest to zero. We can then simulate and see the TPU running well with the ISA additions.

mul16_simWrapping Up

I hope this has shown how easy it was to go in and fix some ISA mistakes made in the past and implement some new functionality. Also, it’s been nice to introduce TASM, despite the assembler itself being about as robust as a matchstick house.

The changes made to the VHDL has increased the resource requirement of the TPU on a Spartan6 LX25 from 3% to 5%, but an increase was expected given so many additional adders.

For next steps, I’m going to concentrate on the top-level VHDL entities for further deployment to miniSpartan6+.

Thanks for reading, comments as always to @domipheus.

Designing a CPU in VHDL, Part 7: Memory Operations, Running on FPGA

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

Memory Operations

We already have a small RAM which holds our instruction stream, but our TPU ISA defines memory read and write instructions, and we should get those instructions working.

It’s the last major functional implementation we need to complete.

pipe7The fetch stage is simply a memory read with the PC on our address bus. It gives a cycle of latency to allow for our instruction to appear on the data out bus of the RAM, ready for decoding. When we encounter a memory ALU operation, we need the control unit to activate the memory stage of the pipeline, which sits after Execute and before Writeback. The way we want this implemented is that the ALU calculates the memory address during execute, and that address is read during the memory stage, and the data passed to the register file during writeback. For a memory write, the ALU calculates the address, and the data we want to write is always on the dataB bus output from the register file, so we connect that up to the memory input bus.

The control unit is modified to add in the memory stage, and also take the ALU operation as an input to do that check. You can see the new unit here.

The Memory Subsystem

Because we now touch memory in multiple pipeline stages, we need to start routing our signals and selecting destinations depending on the current control state. There are various signal inputs that now come from multiple sources:

  1. Register File data input needs to be either dataResult from ALU, or dataReadOutput(ramRData) from memory – when a memory read.
  2. The Instruction Decoder needs connected to the dataReadOutput(ramRData) from memory, as the decoder only decodes during the correct pipeline stage, we don’t care that the input may be different – as long as the instruction data is correct at the decode stage.
  3. The memory write bit needs to know when we are performing a memory write instruction, and not a read.
  4. Memory writes also need to assign the dataWriteInput(ramWData) port with the data we need – contents of the rB register.
  5. The Address sent to the memory needs to be the current PC during fetch, and dataResult when a memory operation.

We can try this without making another functional unit, by just doing some assignments in our test bench source.

ramAddr <= dataResult when en_memory = '1' else PC;
ramWData <= dataB;
ramWE <= '1' when en_memory = '1' and aluop(4 downto 1) = OPCODE_WRITE else '0';

registerWriteData <= ramRData when en_regwrite = '1' and aluop(4 downto 1) = OPCODE_READ else dataResult;
instruction <= ramRData;


we use our existing test bench, with our additional memory system signals. We have a new test instruction stream which we have loaded into the memory which looks like this:

signal ram: store_t := (
  OPCODE_XOR & "000" & '0' & "000" & "000" & "00",
  OPCODE_LOAD & "001" & '1' & X"0f",
  OPCODE_LOAD & "010" & '1' & X"0e",
  OPCODE_LOAD & "110" & '1' & X"0b",
  OPCODE_READ & "100" & '0' & "010" & "100" & "00",
  OPCODE_READ & "101" & '0' & "001" & "100" & "00",
  OPCODE_SUB & "101" & '0' & "101" & "100" & "00",
  OPCODE_WRITE & "000" & '0' & "001" & "101" & "00",
  OPCODE_CMP & "111" & '0' & "101" & "101" & "00",
  OPCODE_JUMPEQ & "000" & '0' & "111" & "110" & "01",
  OPCODE_JUMP & "000" & '1' & X"05",
  OPCODE_JUMP & "000" & '1' & X"0b",

Which, in TPU assembly resembles:

  xor r0, r0, r0
  load.l r1, 0x0f
  load.l r2, 0x0e
  load.l r6, $fin
  read r4, r2
  read r5, r1
  sub.u r5, r5, r4
  write r1, r5
  cmp.u r7, r5, r5
  jaz r7, r6
  jump $loop
  jump $fin

  .loc 0x0e
  data 0x0001
  .loc 0x0f
  data 0x0006

This means we expect to see 0x0000 in the memory location 0x0f after 6 iterations of the loop. From the waveform we can see computation finishes within the simulation time. We can go into the memory view of ISim and we see the result is in the correct place.

first_simThis simulation works with one cycle of memory latency, when using our embedded RAM. If we wanted to go to an external ram such as the DRAM on miniSpartan6+, we’d need to introduce multiple cycles of latency. For this, we should stall the pipeline whilst memory operations complete. We won’t go into that just now, as I think we need to take a step back, and look at the top level view of TPU and try to get what we have on an FPGA.

Top level view

highlevelWith everything built to date, we can see a pretty general outline of a CPU, with the various control lines, data lines, selects, etc. With this implemented as a black box ‘core’, we can try to implement our CPU in such a way that we can view a working test on actual miniSpartan6+ hardware.

Creating a top level block for FPGA hardware

minispartan6The miniSpartan6+ board has 4 switches and 8 LEDs. The top-level block I created has the clock input, the 4 switch inputs and the 8 LED outputs. I still used the embedded RAM. The code within this block resembles the test bench, except there is a process for detecting when the RAM address line is 0x1000 and writing the data to the LED output pins. I use one of the switch inputs to drive the reset line, which actually doesn’t reset the CPU – it simply resets the control unit. As our registers do not get reset, execution continues once reset is deactivated with some existing state present.

The top level entity definition looks like the following:

entity leds_switch_test_expand is
  Port ( I_clk : in  STD_LOGIC;
         I_switch : in  STD_LOGIC_VECTOR (3 downto 0);
         O_leds : out  STD_LOGIC_VECTOR (7 downto 0));
end leds_switch_test_expand;

And pretty much everything remains the same as the simulation test bench, except we no longer use the simulated clock, and we hack in our LED memory mapping:

process(I_clk, O_address)
  if rising_edge(I_clk) then
    if (O_address = X"1000") then
      leds <= dataB(7 downto 0);
    end if;
  end if;
end process;

O_leds <= leds(7 downto 1) & I_reset;
I_reset <= I_switch(0);

As you can see, I use the first led to indicate the state of the reset line, which is useful.

With this new top level entity, we can create a test bench and write a very small code example to write a counter to the LED memory location. The code example below simulates and we see the LED output change. I force initialize the LEDs signal to a known good value as a debugging aid.

  load.l r0, 0x01
  load.l r1, 0x01
  load.h r6, 0x10
  write r6, r0
  add.u r0, r0, r1
  jump $loop

leds_test_simNow we need to look at how we get this VHDL design actually onto the hardware.

Using the miniSpartan6+ board from Windows

There is a great guide for getting the board running from Michael Field who runs the wiki. You should give it a visit! The page in question is the miniSpartan6+ bringup.

I use the exact same method to get the .bit programming files onto the FPGA. This method needs done every time you power the FPGA – it doesn’t write the flash, which would allow for the FPGA design to remain across power resets. Getting that working is for another day.

As explained in the bringup guide, we need to create a ‘User Constraints File’ which at a simple level maps the input and outputs of our entity to real pins on the board. Looking at the miniSpartan6+ schematic we can see what pins are connected where, for example LED6 is connected to the ‘location’ P7.

switch_led_schematic_pinsThere is a full UCF available for the miniSpartan6+ here[], and we can use a subset of it for our uses.

NET "I_clk" PERIOD = 20 ns | LOC = "K3";



The PULLUP parts of the I_SWITCH definitions is very important. My first try at creating this file (before I found the full UCF file on github) omitted the PULLUP, which was never going to work.

pullupWithout the PULLUP, regardless of the switch position, we’ll never get logic ‘1’ at the input. The hatched box happens inside the FPGA, pulling the value to ‘1’ when the switch is not connected to ground. Which is what you want!

generate_prog_fileNow we have our UCF file done, we want to build our ‘Programming File’ which gets uploaded to our FPGA. We make our entity the top module by right clicking it within Implementation mode and selection the option. This unlocks the synthesis options, and we run the ‘Generate Programming File’ option. This can take some time, and will raise warnings, but it completes without error. The steps taken to generate the file are below (taken from Xilinx tutorials)

Synthesis – ‘compiles’ the HDL into netlists and other structures
Translate – merges the incoming netlists and constraints into a Xilinxยฎ design file.
Map – fits the design into the available resources on the target device, and optionally, places the design.
Place and Route – places and routes the design to the timing constraints.
Generate Programming File – creates a bitstream file that can be downloaded to the device.

First Flash

The first time I flashed the FPGA, I was stumped as to why the LEDS were remaining on (apart from the reset LED). Then it became obvious. The clock input is 50MHz. There is no way, with the CPU running that fast, we can see the LEDs change!

Frequency Divider

I solved this by adding a frequency divider into the VHDL. The 50MHz I_clk from the ‘outside world’ is slowed down using a very simple module, which basically counts and uses a bit high up the counter as an output clock. This clock output is then what’s fed into the TPU functional units such as the decoder, as the core_clock in the design. The frequency divider is as follows:

entity clock_divider is
port (
	clk: in std_logic;
	reset: in std_logic;
	clock_out: out std_logic);
end clock_divider;

architecture Behavioral of clock_divider is
  signal scaler : std_logic_vector(23 downto 0) := (others => '0');

    if rising_edge(clk) then   -- rising clock edge
        scaler <= std_logic_vector( unsigned(scaler) + 1);
    end if;
  end process;

clock_out <= scaler(16);

end Behavioral;

Using that divider, it works, and we get counting LEDS!

Wrapping Up

I’ll put the full example top module on github (soon!) as an example, but there is more work to be done in getting it a bit more robust, making the memory mapping actually really mapped (at the moment, a write still actually happens in the RAM but we don’t care or break on it).

For now, it’s pretty cool to see code actually running on a TPU on the FPGA hardware. Additionally, it only uses 3% of the slice resources of the LX25 Spartan6 FPGA, so lots more space to do other things with!

Thanks for reading, comments as always to @domipheus.