Designing a CPU in VHDL, Part 9: Byte addressing, memory subsystem and UART

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing. This part is heavy going if you’ve not read the previous posts.

Byte Addressing

TPU currently operates with memory by addressing 16-bit words. It’s a fairly common set-up for custom processors (addressing non-‘byte’ sizes, that is), but I wanted byte addressing as it simplifies some assembly for operations and really shouldn’t be that difficult to add. There are various things that need to change:

  • The PC needs to increment by 2 each instruction cycle, not 1
  • The read and write memory instructions need to have a size flag
  • The assembler needs to now calculate offsets knowing instructions are 2 bytes and now 2 memory addresses wide
  • Our embedded RAM needs to be able to perform operations only on byte-widths, and also address as bytes.

The PC increment is trivial, simply changing the increment operator of the PC unit to add 2 instead of one. The embedded ram changes are not so bad either. We add a new input port to say whether or not the current operation is on bytes or words. When we are reading just a byte, we want to zero out the high byte of the output port and write just the low byte from memory. For a word operation, we perform two memory/ram reads and write them into the low and high parts of the output. It’s worth noting what we are doing here is technically big-endian; in that, when we do a read from a byte address, the most-significant byte is located at that address, followed by the least significant byte. I also added a chip select, which ‘disconnects’ the output when not enabled.

type store_t is array (0 to 103) of std_logic_vector(7 downto 0);
signal ram: store_t := ...

... snip ...

process (I_clk, I_cs)
  if rising_edge(I_clk) and I_cs = '1' then
    if (I_we = '1') then
      if I_size = '1' then
        -- 1 byte
        ram(int_addr) <= I_data(7 downto 0);
        ram(int_addr) <= I_data(15 downto 8);
        ram(int_addr+1) <= I_data(7 downto 0);
      end if;
      if I_size = '1' then
        data(15 downto 8) <= X"00";
        data(7 downto 0)  <= b1;
        data(15 downto 8) <= b1;
        data(7 downto 0)  <= b2;
      end if;
    end if;
  end if;
end process;

int_addr <= to_integer(unsigned(I_addr(7 downto 0)));

b1 <= ram(int_addr);
b2 <= ram(int_addr+1);

O_data <= data when I_cs = '1' else "ZZZZZZZZZZZZZZZZ";

The embedded ram gave me some problems when it came to synthesis – I had an internal compiler error in the Xilinx tools. I narrowed this down to a single line, and then a single token – one of the boundaries of a (X downto Y) statement. I re-wrote this component to get around the issue, and it also made me realise that this method of implementing the ram may be inefficient in terms of using the rams on-device. The version listed above is the new, re-written version. You can see there are multiple reads and multiple writes each cycle. I’ll need to look into how the internal block rams of the Spartan6 are used to make sure I’m not causing issues.

For the read and write instructions, I’d conveniently left a single bit of the instruction form free for later use. This is the ‘flag bit’ in position 8. When this bit is set, it instructs the memory system to do a byte operation instead of a word operation.

ISAmemoryreadThe changes to the assembler were pretty trivial; enable the read and write instructions for byte operations using the flag bit, and the changes required for byte addressing when calculating label offsets. With those changes, everything worked well and I was able to progress. An interesting point to note now is that I could enforce instructions being on 2-byte boundaries, which would mean all immediate branch offsets could be increased by an extra bit as they are currently. I’ve not done this, yet, but I likely will.

Memory Subsystem

I’d always intended to extend the memory subsystem of TPU as to be able to interface with memories that had different latencies. Currently, it’s fixed that memory takes a single cycle to respond. I wanted to expose an interface that would enable connecting up other memory-mapped devices (like a UART) or even enable the interfacing of SDRAM on the miniSpartan6+ board.

For this I’ve created a ‘core’ component which brings everything bar the embedded ram together in a single object. The memory is handled by an internal controller, which works as a state machine and is triggered by a command port. The state machine ports are exposed like the following.

entity mem_controller is
  Port( I_clk : in  STD_LOGIC;
        I_reset : in STD_LOGIC;

        O_ready : out STD_LOGIC;
        I_execute: in STD_LOGIC;
        I_dataWe : in  STD_LOGIC;
        I_address : in  STD_LOGIC_VECTOR (15 downto 0);
        I_data : in  STD_LOGIC_VECTOR (15 downto 0);
        I_dataByteEn : in STD_LOGIC_VECTOR(1 downto 0);
        O_data : out  STD_LOGIC_VECTOR (15 downto 0);
        O_dataReady: out STD_LOGIC;

        MEM_I_ready: in STD_LOGIC;
        MEM_O_cmd: out STD_LOGIC;
        MEM_O_we : out  STD_LOGIC;
        MEM_O_byteEnable : out STD_LOGIC_VECTOR (1 downto 0);
        MEM_O_addr : out  STD_LOGIC_VECTOR (15 downto 0);
        MEM_O_data : out  STD_LOGIC_VECTOR (15 downto 0);
        MEM_I_data : in  STD_LOGIC_VECTOR (15 downto 0);
        MEM_I_dataReady : in STD_LOGIC
end mem_controller;

In this component we have the internal interface, and then the external interface (prefixed with “MEM_”). The order of operations is as follows:

  1. Wait until O_ready is active
  2. Set the various inputs: address, in data or write enable, and the dataByteEn – this enables 16 or 8-bit memory modes.
  3. Signal the I_execute input until O_ready goes inactive.

This then gets routed to the external ports. I wanted to have something like this within the core object in case the need arose for implementing segmentation or other types of memory addressing extensions. Then the external address buses would be wider, supplemented by a register somewhere to set the high order bits. The actual code for the memory_controller is trivial just now:

architecture Behavioral of mem_controller is
  signal we : std_logic := '0';
  signal addr : STD_LOGIC_VECTOR (15 downto 0) := X"0000";
  signal indata: STD_LOGIC_VECTOR (15 downto 0) := X"0000";
  signal byteEnable: STD_LOGIC_VECTOR ( 1 downto 0) := "11";
  signal cmd : STD_LOGIC := '0';
  signal state: integer := 0;

  process (I_clk, I_execute)
    if rising_edge(I_clk) then
      if I_reset = '1' then
        we <= '0';
        cmd <= '0';
        state <= 0;
      elsif state = 0 and I_execute = '1' and MEM_I_ready = '1' then
        we <= I_dataWe;
        addr <= I_address;
        indata <= I_data;
        byteEnable <= I_dataByteEn;
        cmd <= '1';
        O_dataReady <= '0';
        if I_dataWe = '0' then
          -- read
          state <= 1;
          state <= 2;-- write
        end if;
      elsif state = 1 then
        cmd <= '0';
        if MEM_I_dataReady = '1' then
          O_dataReady <= '1';
          state <= 2;
        end if;
      elsif state = 2 then
        cmd <= '0';
        state <= 0;
        O_dataReady <= '0';
      end if;
    end if;
  end process;
  O_ready <= ( MEM_I_ready and not I_execute ) when state = 0 else '0';
  MEM_O_cmd <= cmd;
  O_data <= MEM_I_data;
  MEM_O_byteEnable <= byteEnable;
  MEM_O_data <= indata;
  MEM_O_addr <= addr;
  MEM_O_we <= we;

end Behavioral;

The main point this serves in terms of TPU is that the control unit uses the output signals from the memory controller at the point when deciding to move to the next stage of the pipeline. This means that stages such as the fetch stage, and the memory stage, wait until the memory subsystem has indicated a command has completed:

For a write, we know the command has executed when the O_ready output goes back to active after deactivating after the signalling of I_execute.
For a read, we know the command has executed when the data we requested presents itself on the O_data line, with O_dataReady signalling the data on the line is valid for the previous request.

The controller does not have any buffering or a queue of operations, so everything waits for the memory system before continuing. At the moment, this is what I want as it makes debugging the system so much easier when things go wrong.

The ‘Core’

Our core TPU object looks like the following:

entity core is
  Port (I_clk : in  STD_LOGIC;
        I_reset : in  STD_LOGIC;
        I_halt : in  STD_LOGIC;

        -- memory interface
        MEM_I_ready : IN  std_logic;
        MEM_O_cmd : OUT  std_logic;
        MEM_O_we : OUT  std_logic;
        MEM_O_byteEnable : OUT  std_logic_vector(1 downto 0);
        MEM_O_addr : OUT  std_logic_vector(15 downto 0);
        MEM_O_data : OUT  std_logic_vector(15 downto 0);
        MEM_I_data : IN  std_logic_vector(15 downto 0);
        MEM_I_dataReady : IN  std_logic
end core;

There are a few signals yet to add here; for one, there are no interrupts yet – something I’d like to add. But as you can see, the core TPU object now just exposes the memory interface, along with the clock and some control. If you apply the clock, a 16-bit request to read address 0 will be asserted, as it attempts to fetch it’s first instruction to execute.

In making a top-level module which we can flash to the FPGA, we need to have one of the cores, and an instance of our embedded ram. We also need to have some sort of logic which can handle the memory system commands from the core – and either let the embedded RAM service it, or some other component – such as our UART.

core_1: core PORT MAP (
  I_clk => I_clk,
  I_reset => I_reset,
  I_halt => I_halt,
  MEM_I_ready => MEM_I_ready,
  MEM_O_cmd => MEM_O_cmd,
  MEM_O_we => MEM_O_we,
  MEM_O_byteEnable => MEM_O_byteEnable,
  MEM_O_addr => MEM_O_addr,
  MEM_O_data => MEM_O_data,
  MEM_I_data => MEM_I_data,
  MEM_I_dataReady => MEM_I_dataReady

ebram_1: ebram Port map ( 
  I_clk => I_clk,
  I_cs => CS_ERAM,
  I_we => MEM_O_we,
  I_addr => MEM_O_addr,
  I_data => MEM_O_data,
  I_size => ram_req_size,
  O_data => ram_output_data

uart_1: uart_simple PORT MAP (
  I_clk => I_clk,
  I_clk_baud_count => I_clk_baud_count,-- 0x1100 -- R/W
  I_reset => I_reset,             
  I_txData => I_txData,                -- 0x1102  -- W
  I_txSig => I_txSig,                  -- 0x1103  -- W
  O_txRdy => O_txRdy,                  -- 0x1104  -- R
  O_tx => O_tx,
  I_rx => I_rx,
  I_rxCont => I_rxCont,                -- 0x1105  -- W
  O_rxData => O_rxData,                -- 0x1106  -- R
  O_rxSig => O_rxSig,                  -- 0x1107  -- R
  O_rxFrameError => O_rxFrameError     -- 0x1108  -- R

In addition to the above parts of the top-level design, there is the LED byte output which ends up mapping to the LEDS on the miniSpartan6+ board. those are written to at location 0x1000, and looking at the uart_1 definition above, the comments indicate which other signals are mapped and at what location. Many of these are 1-bit signals, but I use a whole byte for simplicity.

Whilst the inputs to the embedded ram is connected directly from the MEM_signals, the output (O_data) is mapped to another signal, as any address above 0x1000 I have defined for now as not existing within the embedded RAM.

ram_req_size <= '1' when MEM_O_byteEnable = "10" else '0';
CS_ERAM <= '1' when MEM_O_addr < X"1000" else '0';
MEM_I_data <= ram_output_data when CS_ERAM = '1' else IO_DATA;

ram_output_data is selected by the CPU core appropriately, and there is an IO_DATA signal for any other memory-mapped device. Additionally, ram_req_size is used to translate between the byte enable signal (2 bits corresponding to byte locations) to the 8/16-bit embedded ram input port input.

Now, there is a process which manages the MEM_cmd and ready signals, and forwards/directs data where it needs to go depending on those commands. I’ll paste the whole thing, but it’s a bit hacky and is my first attempt – I’ll probably end up reimplementing it when I need SDRAM integration.

MEM_proc: process(I_clk)
    if rising_edge(I_clk) then
      if MEM_readyState = 0 then
        if MEM_O_cmd = '1' then
          -- LEDS memory map
          if MEM_O_addr = X"1000" then
            -- leds
            IO_LEDS <= MEM_O_data( 7 downto 0);
          end if;
          -- UART1 memory map
          case MEM_O_addr is
            when X"1100" =>
              if MEM_O_we = '0' then
                IO_DATA <= I_clk_baud_count;
                I_clk_baud_count <= MEM_O_data;
              end if;
            when X"1102" =>
              if MEM_O_we = '1' then
                I_txData <= MEM_O_data(7 downto 0);
              end if;
            when X"1103" =>
              if MEM_O_we = '1' then
                I_txSig <= MEM_O_data(0);
              end if;
            when X"1104" =>
              if MEM_O_we = '0' then
                IO_DATA <= X"000" & "000" & O_txRdy;
              end if;
            when X"1105" =>
              if MEM_O_we = '1' then
                I_rxCont <= MEM_O_data(0);
              end if;
            when X"1106" =>
              if MEM_O_we = '0' then
                IO_DATA <= X"00" & O_rxData;
              end if;
            when X"1107" =>
              if MEM_O_we = '0' then
                IO_DATA <= X"000" & "000" & O_rxSig;
              end if;
            when X"1108" =>
              if MEM_O_we = '0' then
                IO_DATA <= X"000" & "000" & O_rxFrameError;
              end if;
            when others =>
          end case;
          MEM_I_ready <= '0';
          MEM_I_dataReady  <= '0';
          if MEM_O_we = '1' then
            MEM_readyState <= 2;
            MEM_readyState <= 1;
          end if;
        end if;
      elsif MEM_readyState >= 1 then
        if MEM_readyState = 1 then
          MEM_I_ready <= '1';
          MEM_I_dataReady  <= '1';
          MEM_readyState <= 0;
        elsif MEM_readyState = 2 then
          MEM_I_ready <= '1';
          MEM_I_dataReady  <= '0';
          MEM_readyState <= 0;
          MEM_readyState <= MEM_readyState + 1;
        end if;
      end if;
    end if;
  end process;

As we disable the embedded ram when the address is >= 0x1000, we don’t need to worry about overwriting memory when looking for a memory mapped device. The LEDS are mapped with a simple if block, and the UART with a case. With this, the TPU core can access memory, write status to LEDS and also use the UART by reading and writing known memory locations.

We can simulate it with a simple test and it works well.

top_module_tbUsing the UART

The UART itself is pretty much exactly the same one in my previous post on the subject. Ive fixed a few issues in the original code (some of which were mentioned in the article). All the ports are memory mapped, so to use the component from TPU assembly, we just read and write to those addresses. This is where 1) the byte addressing mode and 2) the immediate offset of memory addresses in the new read/write instructions help a lot. The general method for transmitting a byte over the UART is as follows:

  1. Wait until the ready bit is set
  2. Set the data input value
  3. Set the transmit signal bit
  4. Wait until the device signals not ready (i.e, working on a request)
  5. Unset the transmit signal bit

For receive, it’s similar:

  1. Ensure the RX is enabled
  2. Wait until the receive signal bit becomes active
  3. Read the output data value
  4. Wait until the receive signal bit de-asserts

In addition, the baud rate can be set by adjusting a counter value which is memory mapped. If you want a baud rate of 9600, you’d set the value to 5208 – which is the main system clock (50MHz) divided by the bit-rate required (50000000/9600).

In TPU assembly, the send and receive functions look simple enough.

  load.l  r1, 1
  load.h  r2, 0x11              #uart1 memory mapped offset r2=0x1100
  read.b  r3, r2, 4             #readb [r2+4] (tx_ready) 
  cmp.u   r3, r3, r3            #compare  r3, %us_waitready     #loop until tx_ready nonzero
  write.b r2, r0, 2             #write txdata register
  write.b r2, r1, 3             #set txsig register 1
  read.b  r3, r2, 4             
  cmp.u   r3, r3, r3 r3, %us_waitunready   #loop until tx_ready zero
  load.l  r1, 0
  write.b r2, r1, 3
  br r6

  load.l  r1, 1
  load.h  r2, 0x11
  write.b r2, r1, 5
  read.b  r3, r2, 7             
  cmp.u   r3, r3, r3  r3, %ur_waitsig    #loop until rx_sig nonzero
  read.b  r0, r2, 6          #read data into r0
  read.b  r3, r2, 7             
  cmp.u   r3, r3, r3 r3, %ur_waitunsig  #loop until rx_sig nonzero
  br r6

In these examples, the ‘functions’ are called with any argument in r0, the return value placed in r0, and the return PC to jump back to in r6. You can send and ‘H’ over the UART using:

  load.l r0, 0x48
  load.l r6, $L_E
  bi $uart_send
  ... snip ...

And receive sets r0 on return to the value sent across the connection.

I’m still assembling these programs and initializing the embedded RAM with those instructions at present; but this UART now gives the possibility of having a fixed bootloader which loads a binary blob from the serial port and then executes it. The assembled instructions can be output in VHDL for easy integration. Here is how an example which prints ‘HELLO’ after receiving a byte (which is discarded) and then continues to echo everything received back to the sender looks.

X"8D", X"04", -- 0000: load.l  r6 0x0004
X"C1", X"4E", -- 0002: bi      0x004e
X"81", X"48", -- 0004: load.l  r0 0x48
X"8D", X"0A", -- 0006: load.l  r6 0x000a
X"C1", X"34", -- 0008: bi      0x0034
X"81", X"45", -- 000A: load.l  r0 0x45
X"8D", X"10", -- 000C: load.l  r6 0x0010
X"C1", X"34", -- 000E: bi      0x0034
X"81", X"4C", -- 0010: load.l  r0 0x4c
X"8D", X"16", -- 0012: load.l  r6 0x0016
X"C1", X"34", -- 0014: bi      0x0034
X"81", X"4C", -- 0016: load.l  r0 0x4c
X"8D", X"1C", -- 0018: load.l  r6 0x001c
X"C1", X"34", -- 001A: bi      0x0034
X"81", X"4F", -- 001C: load.l  r0 0x4f
X"8D", X"22", -- 001E: load.l  r6 0x0022
X"C1", X"34", -- 0020: bi      0x0034
X"81", X"0D", -- 0022: load.l  r0 0x0d
X"8D", X"28", -- 0024: load.l  r6 0x0028
X"C1", X"34", -- 0026: bi      0x0034
X"81", X"00", -- 0028: load.l  r0 0
X"8D", X"2E", -- 002A: load.l  r6 0x002e
X"C1", X"4E", -- 002C: bi      0x004e
X"8D", X"28", -- 002E: load.l  r6 0x0028
X"C1", X"34", -- 0030: bi      0x0034
X"C1", X"64", -- 0032: bi      0x0064
X"83", X"01", -- 0034: load.l  r1 1
X"84", X"11", -- 0036: load.h  r2 0x11 #uart1 memory mapped offset
X"67", X"44", -- 0038: read.b  r3 r2 4 
X"96", X"6C", -- 003A: cmp.u   r3 r3 r3
X"D3", X"7C", -- 003C:  r3 -4   #loop until tx_ready nonzero
X"70", X"42", -- 003E: write.b r2 r0 2 #write txdata register
X"70", X"47", -- 0040: write.b r2 r1 3  
X"67", X"44", -- 0042: read.b  r3 r2 4  
X"96", X"6C", -- 0044: cmp.u   r3 r3 r3
X"D7", X"7C", -- 0046: r3 -4  
X"83", X"00", -- 0048: load.l  r1 0
X"70", X"47", -- 004A: write.b r2 r1 3
X"C0", X"C0", -- 004C: br      r6
X"83", X"01", -- 004E: load.l  r1 1
X"84", X"11", -- 0050: load.h  r2 0x11
X"72", X"45", -- 0052: write.b r2 r1 5
X"67", X"47", -- 0054: read.b  r3 r2 7
X"96", X"6C", -- 0056: cmp.u   r3 r3 r3
X"D3", X"7C", -- 0058:  r3 -4 
X"61", X"46", -- 005A: read.b  r0 r2 6  
X"67", X"47", -- 005C: read.b  r3 r2 7
X"96", X"6C", -- 005E: cmp.u   r3 r3 r3
X"D7", X"7C", -- 0060: r3 -4 
X"C0", X"C0", -- 0062: br      r6
X"C1", X"64", -- 0064: bi      0x0064
X"00", X"00"  -- 0066: dw      0x0000

Using the miniSpartan6+ FTDI chip

One thing mentioned in the post about my own UART implementation is how I used a Teensy3.1 / external USB->serial TTL cable to interface with the FPGA.

I was contacted on twitter and made aware that the FTDI USB chip on the miniSpartan6+ is dual channel, and as well as interfacing with JTAG to flash the FPGA there is an additional COM port exposed through USB, and a TX/RX line connected to the FPGA. By connecting the TX and RX lines of my top level design to these pins, I can use the onboard USB to communicate with TPU!

This is done by assigning the external TX and RX ports of TPU to the pins on the Spartan6 that are connected to the FTDI USB chip. In the UCF constraints file:


Will expose those pins as channel B of the FTDI chip. It seems to communicate at 115200 baud, and in my tests it works well. Executing the TPU assembly above, with this USB->TPU setup, we can connect to the com port and communicate:

putty_helloThat ‘HELLO’ is hella-simple, but it shows a memory-mapped peripheral interface working with TPU, which is quite the milestone!

The State of TPU

TPU now has a decent set of instructions, and the ability to call functions, set up stacks, perform arithmetic and operate memory-mapped peripherals. The current top-level view of the system is below:

arch_overview_1You can see we have some embedded RAM, our UART, and also a memory-mapped register for setting the LEDS on the miniSpartan6+ as well as reading the on-board switches. The core is our new component that simply exposes our memory interface. It’s a bit more than just a CPU, but we need all this to get the system working!

floorplanAt the moment we’re using more of the FPGA than I’d like – mostly due to the UART. It takes a real chunk out, bringing utilization up to 33%. I don’t really mind this, as I know there is a lot of stuff implemented in very bad ways. So I’m not worried – this value will fall.

I’m not quite ready to update the Github with what’s here (this was very rushed) but the new ISA can be found here.

That about wraps up this post. I’m currently looking into interfacing something with the TPU via an additional UART which will be fun (and fairly ridiculous, really) – so there is that to look forward to!

Thanks for reading, comments as always to @domipheus.


A UART Implementation in VHDL

I’m still working on my Soft-CPU TPU, but wanted to implement a communications channel for it to use in order to get some form of input and output from it. The easiest way to do this is to use a UART, and connect it to a USB to Serial converter for logic-level asynchronous communications.

Knowing that I’m still pretty new to VHDL and working with FPGA systems in general at this level, I decided to develop my own UART implementation. Some may roll their eyes at this, knowing there are plenty out there, and even constructs to utilize real hardware on the Spartan 6 FPGA I’m using; but I’m a fan of learning by doing.

Serial Communications

What I’m implementing is a transmitter and receiver which can operate at any baud rate, with 8 data bits, no parity and 1 stop bit. It should be able to communicate over a COM post to a PC, or to another UART. It’s working at Logic-Level voltages, which is very important – you need to use a logic level USB-Serial cable for this. Using an RS232 serial will damage things if it uses the higher voltages specified.

Looking at how we transmit, the waveform looks as follows:

txAssuming that the ‘baud’ clock is running at the correct frequency we require, you can see that it’s fairly simple how all of this works. The idle state for the TX line is always logic high. This may seem weird, but historically the distances the wires crossed meant they were susceptible to damage, and having the idle state high meant if any problem occurred with the physical wires, you’d know about it very quickly.

To transmit an 8-bit byte, a start bit is emitted which is logic low. One ‘baud tick’ later, the least significant bit of the byte is sent, and then every baud tick follows the next bit until the most significant bit is sent. Finally, a stop bit is sent, which is logic high. At this point another byte can be sent immediately – or the line left idle to transmit later, after a delay.

Transmitter States

The transmitter is very simple. There is a data byte input, and a txSig port which is used to signal that the bits on the data output should be sent. When txSig is asserted, state moves from idle to a start state where a start bit is issued. From there, we progress to the data state, where the 8 bits of data are pushed least-significant-bit to output. Finally there is the stop bit state, before moving back to idle, or straight back to start in the case data is being streamed out.

tx_statesFor the states, I use an integer signal as it seemed the simplest and generally the most obvious way to go about it. The whole transmitter code is below.

tx_proc: process (tx_clk, I_reset, I_txSig, tx_state)
  -- TX runs off the TX baud clock
  if rising_edge(tx_clk) then
    if I_reset = '1' then
      tx_state <= 0;
      tx_data <= X"00";
      tx_rdy <= '1';
      tx <= '1';
      if tx_state = 0 and I_txSig = '1' then
        tx_state <= 1;
        tx_data <= I_txData;
        tx_rdy <= '0';
        tx <= '0'; -- start bit
      elsif tx_state < 9 and tx_rdy = '0' then
        tx <= tx_data(0);
        tx_data <= '0' & tx_data (7 downto 1);
        tx_state <= tx_state + 1;
      elsif tx_state = 9 and tx_rdy = '0' then
        tx <= '1'; -- stop bit
        tx_rdy <= '1';
        tx_state <= 0;
      end if;
    end if;
  end if;
end process;

As you can see from the VHDL, the start state is tx_state=0, the data state is tx_state=1..8 and the stop state is tx_state=9. The process is idle when tx_state is 0 with I_txSig=0. The tx_clk baud clock is generated from the higher-frequency system clock using a counter:

  -- TX standard baud clock, no reset
  if tx_clk_counter = 0 then
    -- chop off LSB to get a clock
    tx_clk_counter <= to_integer(unsigned(I_clk_baud_count(15 downto 1)));
    tx_clk <= not tx_clk;
    tx_clk_counter <= tx_clk_counter - 1;
  end if;

The way the I_clk_baud_count value is set is as follows:

I_clk in Hz / expected baud = I_clk_baud_count

So, for 9600bps on a system using a 50MHz clock, one would assign 50000000/9600, or 5208, to I_clk_baud_count. The TX clock is generated by negating the clock signal every 5208/2 counts of the system clock.

Testing the transmitter

Testing this is fairly simple. Auto-generating a test bench, all we need to do is put data on the in port, and then toggle the txSig signal input.

-- send hello\n - 0x48, 0x45, 0x4c, 0x4c, 0x4f, 0x0d
-- H
I_txData <= X"48";
I_txSig <= '1';
wait until O_txRdy= '0';
I_txSig <= '0';

 -- E
wait until O_txRdy= '1';
I_txData <= X"45";
I_txSig <= '1';
wait until O_txRdy= '0';
I_txSig <= '0';


tx_tb_workingThe simulation waveform results in correct TX output, which is great. I wrote up a top-level vhdl module and flashed the MiniSparan6+ board with the same style of test (obviously not using waits – it just endlessly looped over an array containing ‘hello’) and connecting to the FPGA via putty showed the TX did indeed work for this case. Time to implement receive!


Receiving needs to be implemented differently from transmit. That statement is obvious, but it’s all about how the timing is managed and where the serial input is sampled.

If we use our existing example of how we generated the TX output, and use those methods for RX, the below waveform will be the ideal situation.

rx_idealSampling on the rising edge of baud_clk we can see the sampled data is correct; a start bit, 8 did bits, and the stop bit (named ‘e’ just to differentiate). However, we do not control the timing of the RX input. It can be out of phase with the clock, and once it’s out of phase significantly the samples can result in incorrect data, as shown below. Additionally, there is a percentage error allowed in serial communications, and as this error accumulates it can confuse the receiver.

rx_badWe need to use a higher-rate clock, as to lower the accumulated error across a received frame.

Super-sampling (or not)

Generally when working with other UART RX hardware you’ll see mention of a clock at higher than the baud rate. This is due to the internals super-sampling the RX input and then trying to get the ideal sample area, right in the middle of the transmitted bit. For my implementation I cheated, and still only sample once per bit, but I use a 16x baud tick along with a counter for working out where the next bit is likely to be.

The falling edge of the start bit is always sampled at the system clock, in my case, 50MHz. When found, the 16x baud counter is reset, which re-aligns the baud ticks with the start bit. A counter is reset too; as there are 16 baud ticks per bit when receiving, I then sample the start bit when the counter reaches 7, move to the next state, and reset. It will then sample for data bit 0 when the counter reaches 16, and so on, until we have a whole byte of data and an end bit.

sampling_diagramYou can see the 16 baud ticks per transmitted bit in the simulator:

16x_baud_sample_ticksAnd, even see the ticks reset to re-align them with the RX when the start bit is sampled:

rx_start_bit_edge_clk_reset_croppedThe RX Code

The RX clock is based off two things, the main system clock, and then a counter which generates a ‘tick’ every 16x baud clock (the one TX uses). To generate the ticks, we use the code below.

-- RX baud 'ticks' generated for sampling, with reset
if rx_clk_counter = 0 then
  -- x16 sampled - so chop off 4 LSB
  rx_clk_counter <= to_integer(unsigned(I_clk_baud_count(15 downto 4)));
  rx_clk_baud_tick <= '1';
  if rx_clk_reset = '1' then
    rx_clk_counter <= to_integer(unsigned(I_clk_baud_count(15 downto 4)));
    rx_clk_counter <= rx_clk_counter - 1;
  end if;
  rx_clk_baud_tick <= '0';
end if;

There are a few things that could probably be optimized here, but I kept it simple for readability reasons. Note that the RX counter generating the baud ticks has a reset, unlike the transmit clock.

The actual code for the RX is presented below, with the reset, and initial start bit detection.

rx_proc: process (I_clk, I_reset, I_rx, I_rxCont)
  -- RX runs off the system clock, and operates on baud 'ticks'
  if rising_edge(I_clk) then
    if rx_clk_reset = '1' then
      rx_clk_reset <= '0';
    end if;
    if I_reset = '1' then
      rx_state <= 0;
      rx_sig <= '0';
      rx_sample_count <= 0;
      rx_sample_offset <= OFFSET_START_BIT;
      rx_data <= X"00";
      O_rxData <= X"00";
    elsif I_rx = '0' and rx_state = 0 and I_rxCont = '1' then
      -- first encounter of falling edge start
      rx_state <= 1; -- start bit sample stage
      rx_sample_offset <= OFFSET_START_BIT;
      rx_sample_count <= 0;
      -- need to reset the baud tick clock to line up with the start 
      -- bit leading edge.
      rx_clk_reset <= '1';

Skipping the clock reset, which needs to be in this process (this process writes that signal, the other clock-generating process reads it) we have the initial state for the receiver, rx_state=0. This is the initial detection of the start bit, which is rx=’0′, and sampled every system clock cycle (50MHz). Once we find these, and the rxCont input is active (which is basically RX enable) we move to state 1 and set the sample offset to OFFSET_START_BIT, which I can assure you is 7!

    elsif rx_clk_baud_tick = '1' and I_rx = '0' and rx_state = 1 then
      -- inc sample count
      rx_sample_count <= rx_sample_count + 1;
      if rx_sample_count = rx_sample_offset then
        -- start bit sampled, time to enable data
        -- this should check RX here. if it =1, should revert to state 0
        rx_sig <= '0';
        rx_state <= 2;
        rx_data <= X"00";
        rx_sample_offset <= OFFSET_DATA_BITS; 
        rx_sample_count <= 0;
      end if;
    elsif rx_clk_baud_tick = '1' and rx_state >= 2  and rx_state < 10 then
      -- sampling data
      if rx_sample_count = rx_sample_offset then
        rx_data(6 downto 0) <= rx_data(7 downto 1);
        rx_data(7) <= I_rx;
        rx_sample_count <= 0;
        rx_state <= rx_state + 1;
        rx_sample_count <= rx_sample_count + 1;
      end if;
    elsif rx_clk_baud_tick = '1' and rx_state = 10 then
      if rx_sample_count = OFFSET_STOP_BIT then
        rx_state <= 0;
        rx_sig <= '1';
        O_rxData <= rx_data; -- latch data out
        if I_rx = '1' then 
          -- stop bit correct
          rx_frameError <= '0';
          -- stop bit is always high, if we don't see it, there 
          -- has been an issue. Signal an error.
          rx_frameError <= '1';
        end if;
        rx_sample_count <= rx_sample_count + 1;
      end if;
    end if;
  end if;
end process;

The second half simply moves across all the states, sampling the start bit when our sample count gets to the offset required (for the start bit, 7). It then moves state to sampling the 8 data bits, with offsets of 16 a time, before checking the stop bit and indicating whether a frame error has occurred (the stop bit should be ‘1’).

Note that there is no real error checking, or input validation here. Technically, if rx_state=1 and our RX input is high, we should reset the RX system and go back to state 0 – as it was most likely a blip on the input, and not a real serial frame of data. I’ll probably add that later.

Other modules using this RX will use the O_rxSig output to indicate data received, and grab a new byte from the data output port. It stays high until a new frame begins receiving.

Testing the Receiver

Like the testing for transmit, I created a standard test bench, and filled it out with the usual content. For RX, I created a second clock with the same time period as the baud clock the UART is configured to use. I then have an array of the serial bitstream, and each rising edge of that clock I push the next bit across. You can see the whole test on github. Running it in the simulator, it works:

rx_tb_workingI also created a loopback test, which uses the transmitter test bench in two speeds, feeding the TX line into the RX to ensure the data is correct. I’ve got the waveform from a run of that test below, also zoomed into the area where 115200bps is active (at the start).

loopback_tbRunning On Hardware

Running on hardware is easy, just assign the RX and TX ports/nets to external pins. I created a loopback top-level module so I can type into a Putty serial session and see what I type echo back.

The loopback module for the hardware uses another state machine depending on the various UART module signals. The process is fairly simple, and will just push any input from the RX to the TX and set relevant states at the correct times.

loopback: process(I_clk_50mhz, O_rxSig)
  if rising_edge(I_clk_50mhz) then
    if O_rxSig = '1' and last_msg_valid ='0' and new_message = '1' then
      last_msg <= O_rxData;
      last_msg_valid <= '1';
      new_message <= '0';
    elsif O_txRdy = '1' and I_txSig = '0' and last_msg_valid = '1' then
      I_txData <= last_msg;
      I_txSig <= '1';
    elsif I_txSig = '1' and O_txRdy = '0' and last_msg_valid = '1' then
      I_txSig <= '0';
      last_msg_valid <= '0';
    elsif O_rxSig = '0' then
      new_message <= '1';
    end if; 
  end if;
end process;

messSadly, I could not find my USB to Serial TTL converter in my lab mess. It’s in there somewhere. But I did find an old Teensy 3.1 (it’s actually from my TeensyZ80 build) which I used to forward serial to and from the miniSpartan6+. Keys I typed were echoed back, at 115200bps. So a successful test.

putty_loopbackspartan_teensyWrap Up

That pretty much finishes this post off. It’s by no means a finished implementation but works for what I need. I’ll be using it with TPU as a peripheral, and memory mapping the various ports as to control it from software. I think I’ll add some FIFO buffers to the input and output data lines to ensure I don’t loose data, implement the RX error checking I mentioned earlier, and also add a ‘number of frames’ counter for software-side error checking.

It should be made clear though, that there are probably UART constructs available within any recent FPGA what will take up much less resources than this, and they should be used in final projects where possible and sensible!

Thanks for reading, the code for this is available on github.

Designing a CPU in VHDL, Part 8: Revisiting the ISA, function calling, assembler

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

We’re at the point now where the CPU can run some more involved examples. The examples we’ve run to date on the simulator have been fairly simple, and more to the point, tailored to what we have available. I wanted to take a look back at the ISA, to see where we can make some worthwhile changes before moving forward.

Our more complex example code

Trivial 16-bit multiply!

It’s incredibly simple, again. But, that’s because we are missing some pretty fundamental functionality from the TPU. Even this tiny example exposes them.

The example I came up with is as follows:

  1. nominate a register for a stack location and set it.
  2. Set up a simple stack frame to execute a multiply function which takes two 16bit operands.
  3. Call the ‘mul16’ function
  4. in mul16()
    1. grab arguments from the stack
    2. perform the multiplication
    3. return our result in r0
  5. perform some sort of jump away to a safe place of code where we halt using an infinite loop.

This example, in code form, is similar to this:

ushort mul16( ushort a, ushort b)
  ushort sum = 0;
  while (b != 0)
    sum += a;
  return sum;

  ushort ret = mul16(3,7);
  while(1) {
    ret |= ret;

For this example, I defined r7 as the stack register. It was set to the top of our embedded ram block, and the stack will grow downwards. We need to store the two mul16 parameters, as well as our return address. As we address 16 bit words instead of the more typical 8-bit bytes, we only subtract 3 from the current stack pointer value. We then need to write in at various offsets our parameters:

sp = return PC
sp+1 = ushort a
sp+2 = ushort b

The first thing to notice is we are writing these values to constant offsets of a register value r7 (our SP). At the moment, our ISA only has a write to an address which is located in a register, so we need to perform writes and additions to a temporary register, or, we implement new functionality into TPU

Reads and Writes to memory with offset

Currently our write instruction takes a destination memory address specified in rA and a value to write specified in rB. The Read memory instruction is similar, but uses rD for the destination register, and rA as the address. This is due to rD being the only internal data select path into the register file.

Looking at the old instruction forms we have various unused bits that are enough to hold a significant offset value for our memory operations. In the case of the write instruction, these bits are non-contiguous, but we can solve that in the decoder. Our new read instruction looks like the following.

readWith our write instruction a little less clear coming in at


This is when having the immediate data output from the decoder 16-bits becomes useful. We extend the decoder to make those top 8 bits dependant on the instruction opcode, so that when a write is decoded, the immediate offset value is recombined ready for use by the ALU.

  O_dataIMM(15 downto 8) <= I_dataInst(IFO_RD_BEGIN downto IFO_RD_END)
            & I_dataInst(IFO_F2_BEGIN downto IFO_F2_END) & "000";
  O_regDwe <= '0';

The changes to the ALU are minimal, and we just do the inefficient thing of adding another adder. Knowing from the previous part that TPU currently takes up a tiny 3% of the Spartan6 LX25 resources, we can concentrate on getting functionality in rather than optimizing for space.

  -- The result is the address we want.
  -- First 5 bits of the Imm value is an offset.
  s_result(15 downto 0) <= std_logic_vector(signed(I_dataA) + signed(I_dataIMM(15 downto 11)));
  s_shouldBranch <= '0';
  -- The result is the address we want.
  -- Last 5 bits of the Imm value is an offset.
  s_result(15 downto 0) <= std_logic_vector(signed(I_dataA) + signed(I_dataIMM(4 downto 0)));
  s_shouldBranch <= '0';

You can see the ALU code is very similar. We treat the 5-bit immediate as a signed value, as [-16, 15] is a wide enough range of offsets, and being able to offset back as well as forward will come in very handy.

Calling Functions

Getting back to our example, we need to store the program location that we need to return to after executing our mul16 function. Amazingly, we didn’t have an instruction for getting the current PC, so this was impossible. It was very easy to add, though. The current PC is forwarded to the ALU – just use one of the two reserved opcodes we have free to define a set of special state operations.

spc_sstatusThe ALU code to serve these instructions is trivial.

when OPCODE_SPEC => 	-- special
  case I_dataIMM(IFO_F2_BEGIN downto IFO_F2_END) is
    when OPCODE_SPEC_F2_GETPC =>
      s_result(15 downto 0) <= I_PC;
       s_result(1 downto 0) <= s_result(17 downto 16);
    when others =>
  end case;
  s_shouldBranch <= '0';

The sstatus, or get status instruction, will be used to get overflow and carry status bits – which currently are not implemented.

Now that we can get the current PC value, we can use this to calculate the return address for our callee function to jump to on return. The assembly looks as follows.

  load.l  r7, 0x27    # Top of the stack
  load.l  r1, 7       # constant argument 2
  load.l  r2, 3       # constant argument 1
  subi    r7, r7, 3   # reserve 3 words of stack
  write   r7, r1, 2   # write argument at offset +2
  write   r7, r2, 1   # write argument at offset +1
  spc     r6          # get current pc
  addi    r6, r6, 4   # offset to after the call
  write   r7, r6      # put return PC on stack
  bi      $mul16      # call
  addi    r7, r7, 3   # pop stack

This creates a call stack for mul16 containing it’s two parameters, and the location of where it should branch to when it returns.

Immediate arithmetic

You may have noticed two new instructions in the above code snippet – addi and subi. These were added to account for the fact simply incrementing/decrementing registers needed an immediate load, which then used up one of our registers.

The add and sub instructions both have two unused flag bits, so one of them was used to signal intermediate mode. In this mode, rD and rA are used as normal, but rB is disregarded, and 5-bits are used to represent an unsigned immediate value.

addiI took the decision to use only unsigned versions of this instruction, as I thought if someone was really interested in proper overflow detection, they wouldn’t mind taking the additional register penalty, and use the existing add instruction using a register.

In the VHDL, I again didn’t care about resources, and simply added yet another if conditional with adders.

when OPCODE_ADD =>
  if I_aluop(0) = '0' then
    if I_dataImm(0) = '0' then
      s_result(16 downto 0) <= std_logic_vector(unsigned('0' & I_dataA) + unsigned( '0' & I_dataB));
      s_result(16 downto 0) <= std_logic_vector(unsigned('0' & I_dataA) + unsigned( '0' & X"000" & I_dataIMM(4 downto 1)));
    end if;
    s_result(16 downto 0) <= std_logic_vector(signed(I_dataA(15) & I_dataA) + signed( I_dataB(15) & I_dataB));
  end if;
  s_shouldBranch <= '0';

The last 8 bits in dataImm always contain the last 8 bits of our instruction word, so we just use that for both the immediate mode check and then for the 5 bits of value itself.

The mul16 Function

Lets recap the C style version of our function:

ushort mul16( ushort a, ushort b)
  ushort sum = 0;
  while (b != 0)
    sum += a;
  return sum;

And in the TPU assembly written so far, our stack pointed to by r7 resembles the following:

stackThe assembly code therefore, for the mul16 function, is as follows.

  read    r1, r7, 2
  read    r2, r7, 1
  load.l  r0, 0
  cmp.u   r5, r2, r2  r5, %mul16_fin
  add     r0, r0, r1
  subi.u  r2, r2, 1
  bi      $mul16_loop
  read    r6, r7, 0
  br      r6

Pretty simple stuff, but again – a new instruction! = branch to relative offset when A is zero.

Conditional Branch to relative offset

If you remember our previous parts discussing the conditional branching, and even our first part, you’ll remember that they could only branch to a target stored in a register. It was incredibly inefficient for small loops, taking up a register and bloating the code.

Before implementing relative offset branching, there was a need to make the conditional branching instructions more sane. The conditional bits in the instruction which form the type of condition were split and spread out in the instruction form, despite us not using the rD bits. This was changed, so we have a new instruction coding for conditional jumps:

bcondWith this now done, adding relative branch targets was fairly simple. The flag bit (8) is used to detect whether we branch to a register value or an immediate offset from the current PC:

broThe VHDL checks for the flag bit, and selects a different branch target.

  if I_aluop(0) = '1' then
     s_result(15 downto 0) <= std_logic_vector(signed(I_PC) + signed(I_dataIMM(4 downto 0)));
    s_result(15 downto 0) <= I_dataB;
  end if;

You can see the 5-bit immediate is signed, allowing conditional jumps backwards in the instruction stream. As any TIS-100 player will know, JRO’s backwards are very useful – especially in a multiplier 😉

The full multiplier test

I’ve put the full multiplier assembly listing below, which is bulky but I think helps in understanding the flow.

  load.l  r7, 0x27    # Top of the stack
  load.l  r1, 7       # constant argument 1
  load.l  r2, 3       # constant argument 2
  subi    r7, r7, 3   # reserve 3 words of stack
  write   r7, r1, 2   # write argument at offset +2
  write   r7, r2, 1   # write argument at offset +1
  spc     r6          # get current pc
  addi    r6, r6, 4   # offset to after the call
  write   r7, r6      # put return PC on stack
  bi      $mul16     # call
  addi    r7, r7, 3   # pop stack
  bi      $end

# Multiply two u16s. Doesn't check for overflow.
  read    r1, r7, 2
  read    r2, r7, 1
  load.l  r0, 0
  cmp.u   r5, r2, r2  r5, %mul16_fin
  add     r0, r0, r1
  subi.u  r2, r2, 1
  bi      $mul16_loop
  read    r6, r7, 0
  br      r6

  bi     $halt

  or     r0,r0,r0
  bi     $end

If this test works, we should be able to see r0 containing the result of our multiply (21 or 0x15) and the waveform should show the shouldBranch signal oscillating due to the end jump over an or. If shouldBranch is high at all times, we know we’ve hit halt so something isn’t quite right. I’ve not done typical calling convention things such as saving out volatile registers, but it’s easy to see how that would work. But i’m sure those reading by now will be wondering how I get those assembly listings into my test benches in VHDL.

The TPU Assembler – TASM

I have written a 1-file assembler in c# for the current ISA of TPU. In it’s thousand lines of uncommented splendour lies an abundance of coding horrors – fit for the Terrible Processing Unit. It works perfectly well for what I want – just don’t look too deep into it.

I wrote this in a few hours early on in the project, because as you can imagine, writing out instructions forms manually is tedious. The assembler is very simple and is fully self contained without any dependencies. It contains definitions for instructions, how to parse instruction forms, and how to write out their binary representation.

The functional flow for the assembler is as follows:

  1. Parse arguments and open input file
  2. for each line in the input file
    1. if it starts with a ‘#’, ignore it as a comment.
    2. split the line into strings by whitespace and commas
    3. If the first element ends with a ‘:’ treat it as a label and note it’s location
    4. Add the rest as instruction definitions to a list of inputs
  3. For each input definition, replace label names with actual values
  4. parse all definitions into a list of Operation Data objects
  5. Open output file
  6. Output the instruction data using a particular format generator

Assembler Features

The assembler accepts instruction mnemonics as per the ISA document, but will accept some additional ones – like add, which is simply treated as add.u.

There is a data definition (data/dw) which outputs 16-bit hex values directly to the instruction stream, it accepts outputting labels as absolute ($ prefix) and relative (% prefix), but does not currently support the ability to set the current location in memory of definitions – the first line is location 0x0000, and it continues from there.

Errors are not handled gracefully, and there is no real input checking. You could pass a relative offset into a conditional branch which is outside of the bounds of the instruction, and it will generate incorrect code. I’ll fix this stuff at a later date.

Output from the assembler is either binary, hex, or ‘eram’. The Embedded Ram (eram) format is basically VHDL initialization, with the original listing and offsets as comments. The example above assembles to the following:

X"8F27", -- 0000: load.l  r7 0x27 # Top of the stack
X"8307", -- 0001: load.l  r1 7 # constant argument 1
X"8503", -- 0002: load.l  r2 3 # constant argument 2
X"1EE7", -- 0003: subi    r7 r7 3 # reserve 3 words of stack
X"70E6", -- 0004: write   r7 r1 2 # write argument at offset +2
X"70E9", -- 0005: write   r7 r2 1 # write argument at offset +1
X"EC00", -- 0006: spc     r6 # get current pc
X"0CC9", -- 0007: addi    r6 r6 4 # offset to after the call
X"70F8", -- 0008: write   r7 r6 # put return PC on stack
X"C10C", -- 0009: bi      0x000c # call
X"0EE7", -- 000A: addi    r7 r7 3 # pop stack
X"C117", -- 000B: bi      0x0017
X"62E2", -- 000C: read    r1 r7 2
X"64E1", -- 000D: read    r2 r7 1
X"8100", -- 000E: load.l  r0 0
X"9A48", -- 000F: cmp.u   r5 r2 r2
X"D3A4", -- 0010:  r5 4
X"0004", -- 0011: add     r0 r0 r1
X"1443", -- 0012: subi.u  r2 r2 1
X"C10F", -- 0013: bi      0x000f
X"6CE0", -- 0014: read    r6 r7 0
X"C0C0", -- 0015: br      r6
X"C116", -- 0016: bi      0x0016
X"2000", -- 0017: or      r0 r0 r0
X"C117", -- 0018: bi      0x0017

And this is simply pasted into our VHDL ram objects. We need to pad it out to the correct size of the ram – but that is something I want to add as a feature, so you pass in the size of the eRAM and it automatically initializes the rest to zero. We can then simulate and see the TPU running well with the ISA additions.

mul16_simWrapping Up

I hope this has shown how easy it was to go in and fix some ISA mistakes made in the past and implement some new functionality. Also, it’s been nice to introduce TASM, despite the assembler itself being about as robust as a matchstick house.

The changes made to the VHDL has increased the resource requirement of the TPU on a Spartan6 LX25 from 3% to 5%, but an increase was expected given so many additional adders.

For next steps, I’m going to concentrate on the top-level VHDL entities for further deployment to miniSpartan6+.

Thanks for reading, comments as always to @domipheus.

Designing a CPU in VHDL, Part 7: Memory Operations, Running on FPGA

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

Memory Operations

We already have a small RAM which holds our instruction stream, but our TPU ISA defines memory read and write instructions, and we should get those instructions working.

It’s the last major functional implementation we need to complete.

pipe7The fetch stage is simply a memory read with the PC on our address bus. It gives a cycle of latency to allow for our instruction to appear on the data out bus of the RAM, ready for decoding. When we encounter a memory ALU operation, we need the control unit to activate the memory stage of the pipeline, which sits after Execute and before Writeback. The way we want this implemented is that the ALU calculates the memory address during execute, and that address is read during the memory stage, and the data passed to the register file during writeback. For a memory write, the ALU calculates the address, and the data we want to write is always on the dataB bus output from the register file, so we connect that up to the memory input bus.

The control unit is modified to add in the memory stage, and also take the ALU operation as an input to do that check. You can see the new unit here.

The Memory Subsystem

Because we now touch memory in multiple pipeline stages, we need to start routing our signals and selecting destinations depending on the current control state. There are various signal inputs that now come from multiple sources:

  1. Register File data input needs to be either dataResult from ALU, or dataReadOutput(ramRData) from memory – when a memory read.
  2. The Instruction Decoder needs connected to the dataReadOutput(ramRData) from memory, as the decoder only decodes during the correct pipeline stage, we don’t care that the input may be different – as long as the instruction data is correct at the decode stage.
  3. The memory write bit needs to know when we are performing a memory write instruction, and not a read.
  4. Memory writes also need to assign the dataWriteInput(ramWData) port with the data we need – contents of the rB register.
  5. The Address sent to the memory needs to be the current PC during fetch, and dataResult when a memory operation.

We can try this without making another functional unit, by just doing some assignments in our test bench source.

ramAddr <= dataResult when en_memory = '1' else PC;
ramWData <= dataB;
ramWE <= '1' when en_memory = '1' and aluop(4 downto 1) = OPCODE_WRITE else '0';

registerWriteData <= ramRData when en_regwrite = '1' and aluop(4 downto 1) = OPCODE_READ else dataResult;
instruction <= ramRData;


we use our existing test bench, with our additional memory system signals. We have a new test instruction stream which we have loaded into the memory which looks like this:

signal ram: store_t := (
  OPCODE_XOR & "000" & '0' & "000" & "000" & "00",
  OPCODE_LOAD & "001" & '1' & X"0f",
  OPCODE_LOAD & "010" & '1' & X"0e",
  OPCODE_LOAD & "110" & '1' & X"0b",
  OPCODE_READ & "100" & '0' & "010" & "100" & "00",
  OPCODE_READ & "101" & '0' & "001" & "100" & "00",
  OPCODE_SUB & "101" & '0' & "101" & "100" & "00",
  OPCODE_WRITE & "000" & '0' & "001" & "101" & "00",
  OPCODE_CMP & "111" & '0' & "101" & "101" & "00",
  OPCODE_JUMPEQ & "000" & '0' & "111" & "110" & "01",
  OPCODE_JUMP & "000" & '1' & X"05",
  OPCODE_JUMP & "000" & '1' & X"0b",

Which, in TPU assembly resembles:

  xor r0, r0, r0
  load.l r1, 0x0f
  load.l r2, 0x0e
  load.l r6, $fin
  read r4, r2
  read r5, r1
  sub.u r5, r5, r4
  write r1, r5
  cmp.u r7, r5, r5
  jaz r7, r6
  jump $loop
  jump $fin

  .loc 0x0e
  data 0x0001
  .loc 0x0f
  data 0x0006

This means we expect to see 0x0000 in the memory location 0x0f after 6 iterations of the loop. From the waveform we can see computation finishes within the simulation time. We can go into the memory view of ISim and we see the result is in the correct place.

first_simThis simulation works with one cycle of memory latency, when using our embedded RAM. If we wanted to go to an external ram such as the DRAM on miniSpartan6+, we’d need to introduce multiple cycles of latency. For this, we should stall the pipeline whilst memory operations complete. We won’t go into that just now, as I think we need to take a step back, and look at the top level view of TPU and try to get what we have on an FPGA.

Top level view

highlevelWith everything built to date, we can see a pretty general outline of a CPU, with the various control lines, data lines, selects, etc. With this implemented as a black box ‘core’, we can try to implement our CPU in such a way that we can view a working test on actual miniSpartan6+ hardware.

Creating a top level block for FPGA hardware

minispartan6The miniSpartan6+ board has 4 switches and 8 LEDs. The top-level block I created has the clock input, the 4 switch inputs and the 8 LED outputs. I still used the embedded RAM. The code within this block resembles the test bench, except there is a process for detecting when the RAM address line is 0x1000 and writing the data to the LED output pins. I use one of the switch inputs to drive the reset line, which actually doesn’t reset the CPU – it simply resets the control unit. As our registers do not get reset, execution continues once reset is deactivated with some existing state present.

The top level entity definition looks like the following:

entity leds_switch_test_expand is
  Port ( I_clk : in  STD_LOGIC;
         I_switch : in  STD_LOGIC_VECTOR (3 downto 0);
         O_leds : out  STD_LOGIC_VECTOR (7 downto 0));
end leds_switch_test_expand;

And pretty much everything remains the same as the simulation test bench, except we no longer use the simulated clock, and we hack in our LED memory mapping:

process(I_clk, O_address)
  if rising_edge(I_clk) then
    if (O_address = X"1000") then
      leds <= dataB(7 downto 0);
    end if;
  end if;
end process;

O_leds <= leds(7 downto 1) & I_reset;
I_reset <= I_switch(0);

As you can see, I use the first led to indicate the state of the reset line, which is useful.

With this new top level entity, we can create a test bench and write a very small code example to write a counter to the LED memory location. The code example below simulates and we see the LED output change. I force initialize the LEDs signal to a known good value as a debugging aid.

  load.l r0, 0x01
  load.l r1, 0x01
  load.h r6, 0x10
  write r6, r0
  add.u r0, r0, r1
  jump $loop

leds_test_simNow we need to look at how we get this VHDL design actually onto the hardware.

Using the miniSpartan6+ board from Windows

There is a great guide for getting the board running from Michael Field who runs the wiki. You should give it a visit! The page in question is the miniSpartan6+ bringup.

I use the exact same method to get the .bit programming files onto the FPGA. This method needs done every time you power the FPGA – it doesn’t write the flash, which would allow for the FPGA design to remain across power resets. Getting that working is for another day.

As explained in the bringup guide, we need to create a ‘User Constraints File’ which at a simple level maps the input and outputs of our entity to real pins on the board. Looking at the miniSpartan6+ schematic we can see what pins are connected where, for example LED6 is connected to the ‘location’ P7.

switch_led_schematic_pinsThere is a full UCF available for the miniSpartan6+ here[], and we can use a subset of it for our uses.

NET "I_clk" PERIOD = 20 ns | LOC = "K3";



The PULLUP parts of the I_SWITCH definitions is very important. My first try at creating this file (before I found the full UCF file on github) omitted the PULLUP, which was never going to work.

pullupWithout the PULLUP, regardless of the switch position, we’ll never get logic ‘1’ at the input. The hatched box happens inside the FPGA, pulling the value to ‘1’ when the switch is not connected to ground. Which is what you want!

generate_prog_fileNow we have our UCF file done, we want to build our ‘Programming File’ which gets uploaded to our FPGA. We make our entity the top module by right clicking it within Implementation mode and selection the option. This unlocks the synthesis options, and we run the ‘Generate Programming File’ option. This can take some time, and will raise warnings, but it completes without error. The steps taken to generate the file are below (taken from Xilinx tutorials)

Synthesis – ‘compiles’ the HDL into netlists and other structures
Translate – merges the incoming netlists and constraints into a Xilinx® design file.
Map – fits the design into the available resources on the target device, and optionally, places the design.
Place and Route – places and routes the design to the timing constraints.
Generate Programming File – creates a bitstream file that can be downloaded to the device.

First Flash

The first time I flashed the FPGA, I was stumped as to why the LEDS were remaining on (apart from the reset LED). Then it became obvious. The clock input is 50MHz. There is no way, with the CPU running that fast, we can see the LEDs change!

Frequency Divider

I solved this by adding a frequency divider into the VHDL. The 50MHz I_clk from the ‘outside world’ is slowed down using a very simple module, which basically counts and uses a bit high up the counter as an output clock. This clock output is then what’s fed into the TPU functional units such as the decoder, as the core_clock in the design. The frequency divider is as follows:

entity clock_divider is
port (
	clk: in std_logic;
	reset: in std_logic;
	clock_out: out std_logic);
end clock_divider;

architecture Behavioral of clock_divider is
  signal scaler : std_logic_vector(23 downto 0) := (others => '0');

    if rising_edge(clk) then   -- rising clock edge
        scaler <= std_logic_vector( unsigned(scaler) + 1);
    end if;
  end process;

clock_out <= scaler(16);

end Behavioral;

Using that divider, it works, and we get counting LEDS!

Wrapping Up

I’ll put the full example top module on github (soon!) as an example, but there is more work to be done in getting it a bit more robust, making the memory mapping actually really mapped (at the moment, a write still actually happens in the RAM but we don’t care or break on it).

For now, it’s pretty cool to see code actually running on a TPU on the FPGA hardware. Additionally, it only uses 3% of the slice resources of the LX25 Spartan6 FPGA, so lots more space to do other things with!

Thanks for reading, comments as always to @domipheus.