Designing a CPU in VHDL, Part 9: Byte addressing, memory subsystem and UART

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing. This part is heavy going if you’ve not read the previous posts.

Byte Addressing

TPU currently operates with memory by addressing 16-bit words. It’s a fairly common set-up for custom processors (addressing non-‘byte’ sizes, that is), but I wanted byte addressing as it simplifies some assembly for operations and really shouldn’t be that difficult to add. There are various things that need to change:

  • The PC needs to increment by 2 each instruction cycle, not 1
  • The read and write memory instructions need to have a size flag
  • The assembler needs to now calculate offsets knowing instructions are 2 bytes and now 2 memory addresses wide
  • Our embedded RAM needs to be able to perform operations only on byte-widths, and also address as bytes.

The PC increment is trivial, simply changing the increment operator of the PC unit to add 2 instead of one. The embedded ram changes are not so bad either. We add a new input port to say whether or not the current operation is on bytes or words. When we are reading just a byte, we want to zero out the high byte of the output port and write just the low byte from memory. For a word operation, we perform two memory/ram reads and write them into the low and high parts of the output. It’s worth noting what we are doing here is technically big-endian; in that, when we do a read from a byte address, the most-significant byte is located at that address, followed by the least significant byte. I also added a chip select, which ‘disconnects’ the output when not enabled.

type store_t is array (0 to 103) of std_logic_vector(7 downto 0);
signal ram: store_t := ...

... snip ...

process (I_clk, I_cs)
  if rising_edge(I_clk) and I_cs = '1' then
    if (I_we = '1') then
      if I_size = '1' then
        -- 1 byte
        ram(int_addr) <= I_data(7 downto 0);
        ram(int_addr) <= I_data(15 downto 8);
        ram(int_addr+1) <= I_data(7 downto 0);
      end if;
      if I_size = '1' then
        data(15 downto 8) <= X"00";
        data(7 downto 0)  <= b1;
        data(15 downto 8) <= b1;
        data(7 downto 0)  <= b2;
      end if;
    end if;
  end if;
end process;

int_addr <= to_integer(unsigned(I_addr(7 downto 0)));

b1 <= ram(int_addr);
b2 <= ram(int_addr+1);

O_data <= data when I_cs = '1' else "ZZZZZZZZZZZZZZZZ";

The embedded ram gave me some problems when it came to synthesis – I had an internal compiler error in the Xilinx tools. I narrowed this down to a single line, and then a single token – one of the boundaries of a (X downto Y) statement. I re-wrote this component to get around the issue, and it also made me realise that this method of implementing the ram may be inefficient in terms of using the rams on-device. The version listed above is the new, re-written version. You can see there are multiple reads and multiple writes each cycle. I’ll need to look into how the internal block rams of the Spartan6 are used to make sure I’m not causing issues.

For the read and write instructions, I’d conveniently left a single bit of the instruction form free for later use. This is the ‘flag bit’ in position 8. When this bit is set, it instructs the memory system to do a byte operation instead of a word operation.

ISAmemoryreadThe changes to the assembler were pretty trivial; enable the read and write instructions for byte operations using the flag bit, and the changes required for byte addressing when calculating label offsets. With those changes, everything worked well and I was able to progress. An interesting point to note now is that I could enforce instructions being on 2-byte boundaries, which would mean all immediate branch offsets could be increased by an extra bit as they are currently. I’ve not done this, yet, but I likely will.

Memory Subsystem

I’d always intended to extend the memory subsystem of TPU as to be able to interface with memories that had different latencies. Currently, it’s fixed that memory takes a single cycle to respond. I wanted to expose an interface that would enable connecting up other memory-mapped devices (like a UART) or even enable the interfacing of SDRAM on the miniSpartan6+ board.

For this I’ve created a ‘core’ component which brings everything bar the embedded ram together in a single object. The memory is handled by an internal controller, which works as a state machine and is triggered by a command port. The state machine ports are exposed like the following.

entity mem_controller is
  Port( I_clk : in  STD_LOGIC;
        I_reset : in STD_LOGIC;

        O_ready : out STD_LOGIC;
        I_execute: in STD_LOGIC;
        I_dataWe : in  STD_LOGIC;
        I_address : in  STD_LOGIC_VECTOR (15 downto 0);
        I_data : in  STD_LOGIC_VECTOR (15 downto 0);
        I_dataByteEn : in STD_LOGIC_VECTOR(1 downto 0);
        O_data : out  STD_LOGIC_VECTOR (15 downto 0);
        O_dataReady: out STD_LOGIC;

        MEM_I_ready: in STD_LOGIC;
        MEM_O_cmd: out STD_LOGIC;
        MEM_O_we : out  STD_LOGIC;
        MEM_O_byteEnable : out STD_LOGIC_VECTOR (1 downto 0);
        MEM_O_addr : out  STD_LOGIC_VECTOR (15 downto 0);
        MEM_O_data : out  STD_LOGIC_VECTOR (15 downto 0);
        MEM_I_data : in  STD_LOGIC_VECTOR (15 downto 0);
        MEM_I_dataReady : in STD_LOGIC
end mem_controller;

In this component we have the internal interface, and then the external interface (prefixed with “MEM_”). The order of operations is as follows:

  1. Wait until O_ready is active
  2. Set the various inputs: address, in data or write enable, and the dataByteEn – this enables 16 or 8-bit memory modes.
  3. Signal the I_execute input until O_ready goes inactive.

This then gets routed to the external ports. I wanted to have something like this within the core object in case the need arose for implementing segmentation or other types of memory addressing extensions. Then the external address buses would be wider, supplemented by a register somewhere to set the high order bits. The actual code for the memory_controller is trivial just now:

architecture Behavioral of mem_controller is
  signal we : std_logic := '0';
  signal addr : STD_LOGIC_VECTOR (15 downto 0) := X"0000";
  signal indata: STD_LOGIC_VECTOR (15 downto 0) := X"0000";
  signal byteEnable: STD_LOGIC_VECTOR ( 1 downto 0) := "11";
  signal cmd : STD_LOGIC := '0';
  signal state: integer := 0;

  process (I_clk, I_execute)
    if rising_edge(I_clk) then
      if I_reset = '1' then
        we <= '0';
        cmd <= '0';
        state <= 0;
      elsif state = 0 and I_execute = '1' and MEM_I_ready = '1' then
        we <= I_dataWe;
        addr <= I_address;
        indata <= I_data;
        byteEnable <= I_dataByteEn;
        cmd <= '1';
        O_dataReady <= '0';
        if I_dataWe = '0' then
          -- read
          state <= 1;
          state <= 2;-- write
        end if;
      elsif state = 1 then
        cmd <= '0';
        if MEM_I_dataReady = '1' then
          O_dataReady <= '1';
          state <= 2;
        end if;
      elsif state = 2 then
        cmd <= '0';
        state <= 0;
        O_dataReady <= '0';
      end if;
    end if;
  end process;
  O_ready <= ( MEM_I_ready and not I_execute ) when state = 0 else '0';
  MEM_O_cmd <= cmd;
  O_data <= MEM_I_data;
  MEM_O_byteEnable <= byteEnable;
  MEM_O_data <= indata;
  MEM_O_addr <= addr;
  MEM_O_we <= we;

end Behavioral;

The main point this serves in terms of TPU is that the control unit uses the output signals from the memory controller at the point when deciding to move to the next stage of the pipeline. This means that stages such as the fetch stage, and the memory stage, wait until the memory subsystem has indicated a command has completed:

For a write, we know the command has executed when the O_ready output goes back to active after deactivating after the signalling of I_execute.
For a read, we know the command has executed when the data we requested presents itself on the O_data line, with O_dataReady signalling the data on the line is valid for the previous request.

The controller does not have any buffering or a queue of operations, so everything waits for the memory system before continuing. At the moment, this is what I want as it makes debugging the system so much easier when things go wrong.

The ‘Core’

Our core TPU object looks like the following:

entity core is
  Port (I_clk : in  STD_LOGIC;
        I_reset : in  STD_LOGIC;
        I_halt : in  STD_LOGIC;

        -- memory interface
        MEM_I_ready : IN  std_logic;
        MEM_O_cmd : OUT  std_logic;
        MEM_O_we : OUT  std_logic;
        MEM_O_byteEnable : OUT  std_logic_vector(1 downto 0);
        MEM_O_addr : OUT  std_logic_vector(15 downto 0);
        MEM_O_data : OUT  std_logic_vector(15 downto 0);
        MEM_I_data : IN  std_logic_vector(15 downto 0);
        MEM_I_dataReady : IN  std_logic
end core;

There are a few signals yet to add here; for one, there are no interrupts yet – something I’d like to add. But as you can see, the core TPU object now just exposes the memory interface, along with the clock and some control. If you apply the clock, a 16-bit request to read address 0 will be asserted, as it attempts to fetch it’s first instruction to execute.

In making a top-level module which we can flash to the FPGA, we need to have one of the cores, and an instance of our embedded ram. We also need to have some sort of logic which can handle the memory system commands from the core – and either let the embedded RAM service it, or some other component – such as our UART.

core_1: core PORT MAP (
  I_clk => I_clk,
  I_reset => I_reset,
  I_halt => I_halt,
  MEM_I_ready => MEM_I_ready,
  MEM_O_cmd => MEM_O_cmd,
  MEM_O_we => MEM_O_we,
  MEM_O_byteEnable => MEM_O_byteEnable,
  MEM_O_addr => MEM_O_addr,
  MEM_O_data => MEM_O_data,
  MEM_I_data => MEM_I_data,
  MEM_I_dataReady => MEM_I_dataReady

ebram_1: ebram Port map ( 
  I_clk => I_clk,
  I_cs => CS_ERAM,
  I_we => MEM_O_we,
  I_addr => MEM_O_addr,
  I_data => MEM_O_data,
  I_size => ram_req_size,
  O_data => ram_output_data

uart_1: uart_simple PORT MAP (
  I_clk => I_clk,
  I_clk_baud_count => I_clk_baud_count,-- 0x1100 -- R/W
  I_reset => I_reset,             
  I_txData => I_txData,                -- 0x1102  -- W
  I_txSig => I_txSig,                  -- 0x1103  -- W
  O_txRdy => O_txRdy,                  -- 0x1104  -- R
  O_tx => O_tx,
  I_rx => I_rx,
  I_rxCont => I_rxCont,                -- 0x1105  -- W
  O_rxData => O_rxData,                -- 0x1106  -- R
  O_rxSig => O_rxSig,                  -- 0x1107  -- R
  O_rxFrameError => O_rxFrameError     -- 0x1108  -- R

In addition to the above parts of the top-level design, there is the LED byte output which ends up mapping to the LEDS on the miniSpartan6+ board. those are written to at location 0x1000, and looking at the uart_1 definition above, the comments indicate which other signals are mapped and at what location. Many of these are 1-bit signals, but I use a whole byte for simplicity.

Whilst the inputs to the embedded ram is connected directly from the MEM_signals, the output (O_data) is mapped to another signal, as any address above 0x1000 I have defined for now as not existing within the embedded RAM.

ram_req_size <= '1' when MEM_O_byteEnable = "10" else '0';
CS_ERAM <= '1' when MEM_O_addr < X"1000" else '0';
MEM_I_data <= ram_output_data when CS_ERAM = '1' else IO_DATA;

ram_output_data is selected by the CPU core appropriately, and there is an IO_DATA signal for any other memory-mapped device. Additionally, ram_req_size is used to translate between the byte enable signal (2 bits corresponding to byte locations) to the 8/16-bit embedded ram input port input.

Now, there is a process which manages the MEM_cmd and ready signals, and forwards/directs data where it needs to go depending on those commands. I’ll paste the whole thing, but it’s a bit hacky and is my first attempt – I’ll probably end up reimplementing it when I need SDRAM integration.

MEM_proc: process(I_clk)
    if rising_edge(I_clk) then
      if MEM_readyState = 0 then
        if MEM_O_cmd = '1' then
          -- LEDS memory map
          if MEM_O_addr = X"1000" then
            -- leds
            IO_LEDS <= MEM_O_data( 7 downto 0);
          end if;
          -- UART1 memory map
          case MEM_O_addr is
            when X"1100" =>
              if MEM_O_we = '0' then
                IO_DATA <= I_clk_baud_count;
                I_clk_baud_count <= MEM_O_data;
              end if;
            when X"1102" =>
              if MEM_O_we = '1' then
                I_txData <= MEM_O_data(7 downto 0);
              end if;
            when X"1103" =>
              if MEM_O_we = '1' then
                I_txSig <= MEM_O_data(0);
              end if;
            when X"1104" =>
              if MEM_O_we = '0' then
                IO_DATA <= X"000" & "000" & O_txRdy;
              end if;
            when X"1105" =>
              if MEM_O_we = '1' then
                I_rxCont <= MEM_O_data(0);
              end if;
            when X"1106" =>
              if MEM_O_we = '0' then
                IO_DATA <= X"00" & O_rxData;
              end if;
            when X"1107" =>
              if MEM_O_we = '0' then
                IO_DATA <= X"000" & "000" & O_rxSig;
              end if;
            when X"1108" =>
              if MEM_O_we = '0' then
                IO_DATA <= X"000" & "000" & O_rxFrameError;
              end if;
            when others =>
          end case;
          MEM_I_ready <= '0';
          MEM_I_dataReady  <= '0';
          if MEM_O_we = '1' then
            MEM_readyState <= 2;
            MEM_readyState <= 1;
          end if;
        end if;
      elsif MEM_readyState >= 1 then
        if MEM_readyState = 1 then
          MEM_I_ready <= '1';
          MEM_I_dataReady  <= '1';
          MEM_readyState <= 0;
        elsif MEM_readyState = 2 then
          MEM_I_ready <= '1';
          MEM_I_dataReady  <= '0';
          MEM_readyState <= 0;
          MEM_readyState <= MEM_readyState + 1;
        end if;
      end if;
    end if;
  end process;

As we disable the embedded ram when the address is >= 0x1000, we don’t need to worry about overwriting memory when looking for a memory mapped device. The LEDS are mapped with a simple if block, and the UART with a case. With this, the TPU core can access memory, write status to LEDS and also use the UART by reading and writing known memory locations.

We can simulate it with a simple test and it works well.

top_module_tbUsing the UART

The UART itself is pretty much exactly the same one in my previous post on the subject. Ive fixed a few issues in the original code (some of which were mentioned in the article). All the ports are memory mapped, so to use the component from TPU assembly, we just read and write to those addresses. This is where 1) the byte addressing mode and 2) the immediate offset of memory addresses in the new read/write instructions help a lot. The general method for transmitting a byte over the UART is as follows:

  1. Wait until the ready bit is set
  2. Set the data input value
  3. Set the transmit signal bit
  4. Wait until the device signals not ready (i.e, working on a request)
  5. Unset the transmit signal bit

For receive, it’s similar:

  1. Ensure the RX is enabled
  2. Wait until the receive signal bit becomes active
  3. Read the output data value
  4. Wait until the receive signal bit de-asserts

In addition, the baud rate can be set by adjusting a counter value which is memory mapped. If you want a baud rate of 9600, you’d set the value to 5208 – which is the main system clock (50MHz) divided by the bit-rate required (50000000/9600).

In TPU assembly, the send and receive functions look simple enough.

  load.l  r1, 1
  load.h  r2, 0x11              #uart1 memory mapped offset r2=0x1100
  read.b  r3, r2, 4             #readb [r2+4] (tx_ready) 
  cmp.u   r3, r3, r3            #compare  r3, %us_waitready     #loop until tx_ready nonzero
  write.b r2, r0, 2             #write txdata register
  write.b r2, r1, 3             #set txsig register 1
  read.b  r3, r2, 4             
  cmp.u   r3, r3, r3 r3, %us_waitunready   #loop until tx_ready zero
  load.l  r1, 0
  write.b r2, r1, 3
  br r6

  load.l  r1, 1
  load.h  r2, 0x11
  write.b r2, r1, 5
  read.b  r3, r2, 7             
  cmp.u   r3, r3, r3  r3, %ur_waitsig    #loop until rx_sig nonzero
  read.b  r0, r2, 6          #read data into r0
  read.b  r3, r2, 7             
  cmp.u   r3, r3, r3 r3, %ur_waitunsig  #loop until rx_sig nonzero
  br r6

In these examples, the ‘functions’ are called with any argument in r0, the return value placed in r0, and the return PC to jump back to in r6. You can send and ‘H’ over the UART using:

  load.l r0, 0x48
  load.l r6, $L_E
  bi $uart_send
  ... snip ...

And receive sets r0 on return to the value sent across the connection.

I’m still assembling these programs and initializing the embedded RAM with those instructions at present; but this UART now gives the possibility of having a fixed bootloader which loads a binary blob from the serial port and then executes it. The assembled instructions can be output in VHDL for easy integration. Here is how an example which prints ‘HELLO’ after receiving a byte (which is discarded) and then continues to echo everything received back to the sender looks.

X"8D", X"04", -- 0000: load.l  r6 0x0004
X"C1", X"4E", -- 0002: bi      0x004e
X"81", X"48", -- 0004: load.l  r0 0x48
X"8D", X"0A", -- 0006: load.l  r6 0x000a
X"C1", X"34", -- 0008: bi      0x0034
X"81", X"45", -- 000A: load.l  r0 0x45
X"8D", X"10", -- 000C: load.l  r6 0x0010
X"C1", X"34", -- 000E: bi      0x0034
X"81", X"4C", -- 0010: load.l  r0 0x4c
X"8D", X"16", -- 0012: load.l  r6 0x0016
X"C1", X"34", -- 0014: bi      0x0034
X"81", X"4C", -- 0016: load.l  r0 0x4c
X"8D", X"1C", -- 0018: load.l  r6 0x001c
X"C1", X"34", -- 001A: bi      0x0034
X"81", X"4F", -- 001C: load.l  r0 0x4f
X"8D", X"22", -- 001E: load.l  r6 0x0022
X"C1", X"34", -- 0020: bi      0x0034
X"81", X"0D", -- 0022: load.l  r0 0x0d
X"8D", X"28", -- 0024: load.l  r6 0x0028
X"C1", X"34", -- 0026: bi      0x0034
X"81", X"00", -- 0028: load.l  r0 0
X"8D", X"2E", -- 002A: load.l  r6 0x002e
X"C1", X"4E", -- 002C: bi      0x004e
X"8D", X"28", -- 002E: load.l  r6 0x0028
X"C1", X"34", -- 0030: bi      0x0034
X"C1", X"64", -- 0032: bi      0x0064
X"83", X"01", -- 0034: load.l  r1 1
X"84", X"11", -- 0036: load.h  r2 0x11 #uart1 memory mapped offset
X"67", X"44", -- 0038: read.b  r3 r2 4 
X"96", X"6C", -- 003A: cmp.u   r3 r3 r3
X"D3", X"7C", -- 003C:  r3 -4   #loop until tx_ready nonzero
X"70", X"42", -- 003E: write.b r2 r0 2 #write txdata register
X"70", X"47", -- 0040: write.b r2 r1 3  
X"67", X"44", -- 0042: read.b  r3 r2 4  
X"96", X"6C", -- 0044: cmp.u   r3 r3 r3
X"D7", X"7C", -- 0046: r3 -4  
X"83", X"00", -- 0048: load.l  r1 0
X"70", X"47", -- 004A: write.b r2 r1 3
X"C0", X"C0", -- 004C: br      r6
X"83", X"01", -- 004E: load.l  r1 1
X"84", X"11", -- 0050: load.h  r2 0x11
X"72", X"45", -- 0052: write.b r2 r1 5
X"67", X"47", -- 0054: read.b  r3 r2 7
X"96", X"6C", -- 0056: cmp.u   r3 r3 r3
X"D3", X"7C", -- 0058:  r3 -4 
X"61", X"46", -- 005A: read.b  r0 r2 6  
X"67", X"47", -- 005C: read.b  r3 r2 7
X"96", X"6C", -- 005E: cmp.u   r3 r3 r3
X"D7", X"7C", -- 0060: r3 -4 
X"C0", X"C0", -- 0062: br      r6
X"C1", X"64", -- 0064: bi      0x0064
X"00", X"00"  -- 0066: dw      0x0000

Using the miniSpartan6+ FTDI chip

One thing mentioned in the post about my own UART implementation is how I used a Teensy3.1 / external USB->serial TTL cable to interface with the FPGA.

I was contacted on twitter and made aware that the FTDI USB chip on the miniSpartan6+ is dual channel, and as well as interfacing with JTAG to flash the FPGA there is an additional COM port exposed through USB, and a TX/RX line connected to the FPGA. By connecting the TX and RX lines of my top level design to these pins, I can use the onboard USB to communicate with TPU!

This is done by assigning the external TX and RX ports of TPU to the pins on the Spartan6 that are connected to the FTDI USB chip. In the UCF constraints file:


Will expose those pins as channel B of the FTDI chip. It seems to communicate at 115200 baud, and in my tests it works well. Executing the TPU assembly above, with this USB->TPU setup, we can connect to the com port and communicate:

putty_helloThat ‘HELLO’ is hella-simple, but it shows a memory-mapped peripheral interface working with TPU, which is quite the milestone!

The State of TPU

TPU now has a decent set of instructions, and the ability to call functions, set up stacks, perform arithmetic and operate memory-mapped peripherals. The current top-level view of the system is below:

arch_overview_1You can see we have some embedded RAM, our UART, and also a memory-mapped register for setting the LEDS on the miniSpartan6+ as well as reading the on-board switches. The core is our new component that simply exposes our memory interface. It’s a bit more than just a CPU, but we need all this to get the system working!

floorplanAt the moment we’re using more of the FPGA than I’d like – mostly due to the UART. It takes a real chunk out, bringing utilization up to 33%. I don’t really mind this, as I know there is a lot of stuff implemented in very bad ways. So I’m not worried – this value will fall.

I’m not quite ready to update the Github with what’s here (this was very rushed) but the new ISA can be found here.

That about wraps up this post. I’m currently looking into interfacing something with the TPU via an additional UART which will be fun (and fairly ridiculous, really) – so there is that to look forward to!

Thanks for reading, comments as always to @domipheus.


A UART Implementation in VHDL

I’m still working on my Soft-CPU TPU, but wanted to implement a communications channel for it to use in order to get some form of input and output from it. The easiest way to do this is to use a UART, and connect it to a USB to Serial converter for logic-level asynchronous communications.

Knowing that I’m still pretty new to VHDL and working with FPGA systems in general at this level, I decided to develop my own UART implementation. Some may roll their eyes at this, knowing there are plenty out there, and even constructs to utilize real hardware on the Spartan 6 FPGA I’m using; but I’m a fan of learning by doing.

Serial Communications

What I’m implementing is a transmitter and receiver which can operate at any baud rate, with 8 data bits, no parity and 1 stop bit. It should be able to communicate over a COM post to a PC, or to another UART. It’s working at Logic-Level voltages, which is very important – you need to use a logic level USB-Serial cable for this. Using an RS232 serial will damage things if it uses the higher voltages specified.

Looking at how we transmit, the waveform looks as follows:

txAssuming that the ‘baud’ clock is running at the correct frequency we require, you can see that it’s fairly simple how all of this works. The idle state for the TX line is always logic high. This may seem weird, but historically the distances the wires crossed meant they were susceptible to damage, and having the idle state high meant if any problem occurred with the physical wires, you’d know about it very quickly.

To transmit an 8-bit byte, a start bit is emitted which is logic low. One ‘baud tick’ later, the least significant bit of the byte is sent, and then every baud tick follows the next bit until the most significant bit is sent. Finally, a stop bit is sent, which is logic high. At this point another byte can be sent immediately – or the line left idle to transmit later, after a delay.

Transmitter States

The transmitter is very simple. There is a data byte input, and a txSig port which is used to signal that the bits on the data output should be sent. When txSig is asserted, state moves from idle to a start state where a start bit is issued. From there, we progress to the data state, where the 8 bits of data are pushed least-significant-bit to output. Finally there is the stop bit state, before moving back to idle, or straight back to start in the case data is being streamed out.

tx_statesFor the states, I use an integer signal as it seemed the simplest and generally the most obvious way to go about it. The whole transmitter code is below.

tx_proc: process (tx_clk, I_reset, I_txSig, tx_state)
  -- TX runs off the TX baud clock
  if rising_edge(tx_clk) then
    if I_reset = '1' then
      tx_state <= 0;
      tx_data <= X"00";
      tx_rdy <= '1';
      tx <= '1';
      if tx_state = 0 and I_txSig = '1' then
        tx_state <= 1;
        tx_data <= I_txData;
        tx_rdy <= '0';
        tx <= '0'; -- start bit
      elsif tx_state < 9 and tx_rdy = '0' then
        tx <= tx_data(0);
        tx_data <= '0' & tx_data (7 downto 1);
        tx_state <= tx_state + 1;
      elsif tx_state = 9 and tx_rdy = '0' then
        tx <= '1'; -- stop bit
        tx_rdy <= '1';
        tx_state <= 0;
      end if;
    end if;
  end if;
end process;

As you can see from the VHDL, the start state is tx_state=0, the data state is tx_state=1..8 and the stop state is tx_state=9. The process is idle when tx_state is 0 with I_txSig=0. The tx_clk baud clock is generated from the higher-frequency system clock using a counter:

  -- TX standard baud clock, no reset
  if tx_clk_counter = 0 then
    -- chop off LSB to get a clock
    tx_clk_counter <= to_integer(unsigned(I_clk_baud_count(15 downto 1)));
    tx_clk <= not tx_clk;
    tx_clk_counter <= tx_clk_counter - 1;
  end if;

The way the I_clk_baud_count value is set is as follows:

I_clk in Hz / expected baud = I_clk_baud_count

So, for 9600bps on a system using a 50MHz clock, one would assign 50000000/9600, or 5208, to I_clk_baud_count. The TX clock is generated by negating the clock signal every 5208/2 counts of the system clock.

Testing the transmitter

Testing this is fairly simple. Auto-generating a test bench, all we need to do is put data on the in port, and then toggle the txSig signal input.

-- send hello\n - 0x48, 0x45, 0x4c, 0x4c, 0x4f, 0x0d
-- H
I_txData <= X"48";
I_txSig <= '1';
wait until O_txRdy= '0';
I_txSig <= '0';

 -- E
wait until O_txRdy= '1';
I_txData <= X"45";
I_txSig <= '1';
wait until O_txRdy= '0';
I_txSig <= '0';


tx_tb_workingThe simulation waveform results in correct TX output, which is great. I wrote up a top-level vhdl module and flashed the MiniSparan6+ board with the same style of test (obviously not using waits – it just endlessly looped over an array containing ‘hello’) and connecting to the FPGA via putty showed the TX did indeed work for this case. Time to implement receive!


Receiving needs to be implemented differently from transmit. That statement is obvious, but it’s all about how the timing is managed and where the serial input is sampled.

If we use our existing example of how we generated the TX output, and use those methods for RX, the below waveform will be the ideal situation.

rx_idealSampling on the rising edge of baud_clk we can see the sampled data is correct; a start bit, 8 did bits, and the stop bit (named ‘e’ just to differentiate). However, we do not control the timing of the RX input. It can be out of phase with the clock, and once it’s out of phase significantly the samples can result in incorrect data, as shown below. Additionally, there is a percentage error allowed in serial communications, and as this error accumulates it can confuse the receiver.

rx_badWe need to use a higher-rate clock, as to lower the accumulated error across a received frame.

Super-sampling (or not)

Generally when working with other UART RX hardware you’ll see mention of a clock at higher than the baud rate. This is due to the internals super-sampling the RX input and then trying to get the ideal sample area, right in the middle of the transmitted bit. For my implementation I cheated, and still only sample once per bit, but I use a 16x baud tick along with a counter for working out where the next bit is likely to be.

The falling edge of the start bit is always sampled at the system clock, in my case, 50MHz. When found, the 16x baud counter is reset, which re-aligns the baud ticks with the start bit. A counter is reset too; as there are 16 baud ticks per bit when receiving, I then sample the start bit when the counter reaches 7, move to the next state, and reset. It will then sample for data bit 0 when the counter reaches 16, and so on, until we have a whole byte of data and an end bit.

sampling_diagramYou can see the 16 baud ticks per transmitted bit in the simulator:

16x_baud_sample_ticksAnd, even see the ticks reset to re-align them with the RX when the start bit is sampled:

rx_start_bit_edge_clk_reset_croppedThe RX Code

The RX clock is based off two things, the main system clock, and then a counter which generates a ‘tick’ every 16x baud clock (the one TX uses). To generate the ticks, we use the code below.

-- RX baud 'ticks' generated for sampling, with reset
if rx_clk_counter = 0 then
  -- x16 sampled - so chop off 4 LSB
  rx_clk_counter <= to_integer(unsigned(I_clk_baud_count(15 downto 4)));
  rx_clk_baud_tick <= '1';
  if rx_clk_reset = '1' then
    rx_clk_counter <= to_integer(unsigned(I_clk_baud_count(15 downto 4)));
    rx_clk_counter <= rx_clk_counter - 1;
  end if;
  rx_clk_baud_tick <= '0';
end if;

There are a few things that could probably be optimized here, but I kept it simple for readability reasons. Note that the RX counter generating the baud ticks has a reset, unlike the transmit clock.

The actual code for the RX is presented below, with the reset, and initial start bit detection.

rx_proc: process (I_clk, I_reset, I_rx, I_rxCont)
  -- RX runs off the system clock, and operates on baud 'ticks'
  if rising_edge(I_clk) then
    if rx_clk_reset = '1' then
      rx_clk_reset <= '0';
    end if;
    if I_reset = '1' then
      rx_state <= 0;
      rx_sig <= '0';
      rx_sample_count <= 0;
      rx_sample_offset <= OFFSET_START_BIT;
      rx_data <= X"00";
      O_rxData <= X"00";
    elsif I_rx = '0' and rx_state = 0 and I_rxCont = '1' then
      -- first encounter of falling edge start
      rx_state <= 1; -- start bit sample stage
      rx_sample_offset <= OFFSET_START_BIT;
      rx_sample_count <= 0;
      -- need to reset the baud tick clock to line up with the start 
      -- bit leading edge.
      rx_clk_reset <= '1';

Skipping the clock reset, which needs to be in this process (this process writes that signal, the other clock-generating process reads it) we have the initial state for the receiver, rx_state=0. This is the initial detection of the start bit, which is rx=’0′, and sampled every system clock cycle (50MHz). Once we find these, and the rxCont input is active (which is basically RX enable) we move to state 1 and set the sample offset to OFFSET_START_BIT, which I can assure you is 7!

    elsif rx_clk_baud_tick = '1' and I_rx = '0' and rx_state = 1 then
      -- inc sample count
      rx_sample_count <= rx_sample_count + 1;
      if rx_sample_count = rx_sample_offset then
        -- start bit sampled, time to enable data
        -- this should check RX here. if it =1, should revert to state 0
        rx_sig <= '0';
        rx_state <= 2;
        rx_data <= X"00";
        rx_sample_offset <= OFFSET_DATA_BITS; 
        rx_sample_count <= 0;
      end if;
    elsif rx_clk_baud_tick = '1' and rx_state >= 2  and rx_state < 10 then
      -- sampling data
      if rx_sample_count = rx_sample_offset then
        rx_data(6 downto 0) <= rx_data(7 downto 1);
        rx_data(7) <= I_rx;
        rx_sample_count <= 0;
        rx_state <= rx_state + 1;
        rx_sample_count <= rx_sample_count + 1;
      end if;
    elsif rx_clk_baud_tick = '1' and rx_state = 10 then
      if rx_sample_count = OFFSET_STOP_BIT then
        rx_state <= 0;
        rx_sig <= '1';
        O_rxData <= rx_data; -- latch data out
        if I_rx = '1' then 
          -- stop bit correct
          rx_frameError <= '0';
          -- stop bit is always high, if we don't see it, there 
          -- has been an issue. Signal an error.
          rx_frameError <= '1';
        end if;
        rx_sample_count <= rx_sample_count + 1;
      end if;
    end if;
  end if;
end process;

The second half simply moves across all the states, sampling the start bit when our sample count gets to the offset required (for the start bit, 7). It then moves state to sampling the 8 data bits, with offsets of 16 a time, before checking the stop bit and indicating whether a frame error has occurred (the stop bit should be ‘1’).

Note that there is no real error checking, or input validation here. Technically, if rx_state=1 and our RX input is high, we should reset the RX system and go back to state 0 – as it was most likely a blip on the input, and not a real serial frame of data. I’ll probably add that later.

Other modules using this RX will use the O_rxSig output to indicate data received, and grab a new byte from the data output port. It stays high until a new frame begins receiving.

Testing the Receiver

Like the testing for transmit, I created a standard test bench, and filled it out with the usual content. For RX, I created a second clock with the same time period as the baud clock the UART is configured to use. I then have an array of the serial bitstream, and each rising edge of that clock I push the next bit across. You can see the whole test on github. Running it in the simulator, it works:

rx_tb_workingI also created a loopback test, which uses the transmitter test bench in two speeds, feeding the TX line into the RX to ensure the data is correct. I’ve got the waveform from a run of that test below, also zoomed into the area where 115200bps is active (at the start).

loopback_tbRunning On Hardware

Running on hardware is easy, just assign the RX and TX ports/nets to external pins. I created a loopback top-level module so I can type into a Putty serial session and see what I type echo back.

The loopback module for the hardware uses another state machine depending on the various UART module signals. The process is fairly simple, and will just push any input from the RX to the TX and set relevant states at the correct times.

loopback: process(I_clk_50mhz, O_rxSig)
  if rising_edge(I_clk_50mhz) then
    if O_rxSig = '1' and last_msg_valid ='0' and new_message = '1' then
      last_msg <= O_rxData;
      last_msg_valid <= '1';
      new_message <= '0';
    elsif O_txRdy = '1' and I_txSig = '0' and last_msg_valid = '1' then
      I_txData <= last_msg;
      I_txSig <= '1';
    elsif I_txSig = '1' and O_txRdy = '0' and last_msg_valid = '1' then
      I_txSig <= '0';
      last_msg_valid <= '0';
    elsif O_rxSig = '0' then
      new_message <= '1';
    end if; 
  end if;
end process;

messSadly, I could not find my USB to Serial TTL converter in my lab mess. It’s in there somewhere. But I did find an old Teensy 3.1 (it’s actually from my TeensyZ80 build) which I used to forward serial to and from the miniSpartan6+. Keys I typed were echoed back, at 115200bps. So a successful test.

putty_loopbackspartan_teensyWrap Up

That pretty much finishes this post off. It’s by no means a finished implementation but works for what I need. I’ll be using it with TPU as a peripheral, and memory mapping the various ports as to control it from software. I think I’ll add some FIFO buffers to the input and output data lines to ensure I don’t loose data, implement the RX error checking I mentioned earlier, and also add a ‘number of frames’ counter for software-side error checking.

It should be made clear though, that there are probably UART constructs available within any recent FPGA what will take up much less resources than this, and they should be used in final projects where possible and sensible!

Thanks for reading, the code for this is available on github.