Designing a CPU in VHDL, Part 8: Revisiting the ISA, function calling, assembler

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

We’re at the point now where the CPU can run some more involved examples. The examples we’ve run to date on the simulator have been fairly simple, and more to the point, tailored to what we have available. I wanted to take a look back at the ISA, to see where we can make some worthwhile changes before moving forward.

Our more complex example code

Trivial 16-bit multiply!

It’s incredibly simple, again. But, that’s because we are missing some pretty fundamental functionality from the TPU. Even this tiny example exposes them.

The example I came up with is as follows:

  1. nominate a register for a stack location and set it.
  2. Set up a simple stack frame to execute a multiply function which takes two 16bit operands.
  3. Call the ‘mul16′ function
  4. in mul16()
    1. grab arguments from the stack
    2. perform the multiplication
    3. return our result in r0
  5. perform some sort of jump away to a safe place of code where we halt using an infinite loop.

This example, in code form, is similar to this:

ushort mul16( ushort a, ushort b)
{
  ushort sum = 0;
  while (b != 0)
  {
    sum += a;
    b--;
  }
  return sum;
}

main()
{
  ushort ret = mul16(3,7);
  while(1) {
    ret |= ret;
  }
}

For this example, I defined r7 as the stack register. It was set to the top of our embedded ram block, and the stack will grow downwards. We need to store the two mul16 parameters, as well as our return address. As we address 16 bit words instead of the more typical 8-bit bytes, we only subtract 3 from the current stack pointer value. We then need to write in at various offsets our parameters:

sp = return PC
sp+1 = ushort a
sp+2 = ushort b

The first thing to notice is we are writing these values to constant offsets of a register value r7 (our SP). At the moment, our ISA only has a write to an address which is located in a register, so we need to perform writes and additions to a temporary register, or, we implement new functionality into TPU

Reads and Writes to memory with offset

Currently our write instruction takes a destination memory address specified in rA and a value to write specified in rB. The Read memory instruction is similar, but uses rD for the destination register, and rA as the address. This is due to rD being the only internal data select path into the register file.

Looking at the old instruction forms we have various unused bits that are enough to hold a significant offset value for our memory operations. In the case of the write instruction, these bits are non-contiguous, but we can solve that in the decoder. Our new read instruction looks like the following.

readWith our write instruction a little less clear coming in at

write

This is when having the immediate data output from the decoder 16-bits becomes useful. We extend the decoder to make those top 8 bits dependant on the instruction opcode, so that when a write is decoded, the immediate offset value is recombined ready for use by the ALU.

when OPCODE_WRITE =>
  O_dataIMM(15 downto 8) <= I_dataInst(IFO_RD_BEGIN downto IFO_RD_END)
            & I_dataInst(IFO_F2_BEGIN downto IFO_F2_END) & "000";
  O_regDwe <= '0';

The changes to the ALU are minimal, and we just do the inefficient thing of adding another adder. Knowing from the previous part that TPU currently takes up a tiny 3% of the Spartan6 LX25 resources, we can concentrate on getting functionality in rather than optimizing for space.

when OPCODE_WRITE =>
  -- The result is the address we want.
  -- First 5 bits of the Imm value is an offset.
  s_result(15 downto 0) <= std_logic_vector(signed(I_dataA) + signed(I_dataIMM(15 downto 11)));
  s_shouldBranch <= '0';
when OPCODE_READ =>
  -- The result is the address we want.
  -- Last 5 bits of the Imm value is an offset.
  s_result(15 downto 0) <= std_logic_vector(signed(I_dataA) + signed(I_dataIMM(4 downto 0)));
  s_shouldBranch <= '0';

You can see the ALU code is very similar. We treat the 5-bit immediate as a signed value, as [-16, 15] is a wide enough range of offsets, and being able to offset back as well as forward will come in very handy.

Calling Functions

Getting back to our example, we need to store the program location that we need to return to after executing our mul16 function. Amazingly, we didn’t have an instruction for getting the current PC, so this was impossible. It was very easy to add, though. The current PC is forwarded to the ALU – just use one of the two reserved opcodes we have free to define a set of special state operations.

spc_sstatusThe ALU code to serve these instructions is trivial.

when OPCODE_SPEC => 	-- special
  case I_dataIMM(IFO_F2_BEGIN downto IFO_F2_END) is
    when OPCODE_SPEC_F2_GETPC =>
      s_result(15 downto 0) <= I_PC;
    when OPCODE_SPEC_F2_GETSTATUS =>
       s_result(1 downto 0) <= s_result(17 downto 16);
    when others =>
  end case;
  s_shouldBranch <= '0';

The sstatus, or get status instruction, will be used to get overflow and carry status bits – which currently are not implemented.

Now that we can get the current PC value, we can use this to calculate the return address for our callee function to jump to on return. The assembly looks as follows.

start:
  load.l  r7, 0x27    # Top of the stack
  load.l  r1, 7       # constant argument 2
  load.l  r2, 3       # constant argument 1
  subi    r7, r7, 3   # reserve 3 words of stack
  write   r7, r1, 2   # write argument at offset +2
  write   r7, r2, 1   # write argument at offset +1
  spc     r6          # get current pc
  addi    r6, r6, 4   # offset to after the call
  write   r7, r6      # put return PC on stack
  bi      $mul16      # call
  addi    r7, r7, 3   # pop stack

This creates a call stack for mul16 containing it’s two parameters, and the location of where it should branch to when it returns.

Immediate arithmetic

You may have noticed two new instructions in the above code snippet – addi and subi. These were added to account for the fact simply incrementing/decrementing registers needed an immediate load, which then used up one of our registers.

The add and sub instructions both have two unused flag bits, so one of them was used to signal intermediate mode. In this mode, rD and rA are used as normal, but rB is disregarded, and 5-bits are used to represent an unsigned immediate value.

addiI took the decision to use only unsigned versions of this instruction, as I thought if someone was really interested in proper overflow detection, they wouldn’t mind taking the additional register penalty, and use the existing add instruction using a register.

In the VHDL, I again didn’t care about resources, and simply added yet another if conditional with adders.

when OPCODE_ADD =>
  if I_aluop(0) = '0' then
    if I_dataImm(0) = '0' then
      s_result(16 downto 0) <= std_logic_vector(unsigned('0' & I_dataA) + unsigned( '0' & I_dataB));
    else
      s_result(16 downto 0) <= std_logic_vector(unsigned('0' & I_dataA) + unsigned( '0' & X"000" & I_dataIMM(4 downto 1)));
    end if;
  else
    s_result(16 downto 0) <= std_logic_vector(signed(I_dataA(15) & I_dataA) + signed( I_dataB(15) & I_dataB));
  end if;
  s_shouldBranch <= '0';

The last 8 bits in dataImm always contain the last 8 bits of our instruction word, so we just use that for both the immediate mode check and then for the 5 bits of value itself.

The mul16 Function

Lets recap the C style version of our function:

ushort mul16( ushort a, ushort b)
{
  ushort sum = 0;
  while (b != 0)
  {
    sum += a;
    b--;
  }
  return sum;
}

And in the TPU assembly written so far, our stack pointed to by r7 resembles the following:

stackThe assembly code therefore, for the mul16 function, is as follows.

mul16:
  read    r1, r7, 2
  read    r2, r7, 1
  load.l  r0, 0
mul16_loop:
  cmp.u   r5, r2, r2
  bro.az  r5, %mul16_fin
  add     r0, r0, r1
  subi.u  r2, r2, 1
  bi      $mul16_loop
mul16_fin:
  read    r6, r7, 0
  br      r6

Pretty simple stuff, but again – a new instruction! bro.az = branch to relative offset when A is zero.

Conditional Branch to relative offset

If you remember our previous parts discussing the conditional branching, and even our first part, you’ll remember that they could only branch to a target stored in a register. It was incredibly inefficient for small loops, taking up a register and bloating the code.

Before implementing relative offset branching, there was a need to make the conditional branching instructions more sane. The conditional bits in the instruction which form the type of condition were split and spread out in the instruction form, despite us not using the rD bits. This was changed, so we have a new instruction coding for conditional jumps:

bcondWith this now done, adding relative branch targets was fairly simple. The flag bit (8) is used to detect whether we branch to a register value or an immediate offset from the current PC:

broThe VHDL checks for the flag bit, and selects a different branch target.

when OPCODE_JUMPEQ =>
  if I_aluop(0) = '1' then
     s_result(15 downto 0) <= std_logic_vector(signed(I_PC) + signed(I_dataIMM(4 downto 0)));
  else
    s_result(15 downto 0) <= I_dataB;
  end if;

You can see the 5-bit immediate is signed, allowing conditional jumps backwards in the instruction stream. As any TIS-100 player will know, JRO’s backwards are very useful – especially in a multiplier 😉

The full multiplier test

I’ve put the full multiplier assembly listing below, which is bulky but I think helps in understanding the flow.

start:
  load.l  r7, 0x27    # Top of the stack
  load.l  r1, 7       # constant argument 1
  load.l  r2, 3       # constant argument 2
  subi    r7, r7, 3   # reserve 3 words of stack
  write   r7, r1, 2   # write argument at offset +2
  write   r7, r2, 1   # write argument at offset +1
  spc     r6          # get current pc
  addi    r6, r6, 4   # offset to after the call
  write   r7, r6      # put return PC on stack
  bi      $mul16     # call
  addi    r7, r7, 3   # pop stack
  bi      $end

# Multiply two u16s. Doesn't check for overflow.
mul16:
  read    r1, r7, 2
  read    r2, r7, 1
  load.l  r0, 0
mul16_loop:
  cmp.u   r5, r2, r2
  bro.az  r5, %mul16_fin
  add     r0, r0, r1
  subi.u  r2, r2, 1
  bi      $mul16_loop
mul16_fin:
  read    r6, r7, 0
  br      r6

halt:
  bi     $halt

end:
  or     r0,r0,r0
  bi     $end

If this test works, we should be able to see r0 containing the result of our multiply (21 or 0x15) and the waveform should show the shouldBranch signal oscillating due to the end jump over an or. If shouldBranch is high at all times, we know we’ve hit halt so something isn’t quite right. I’ve not done typical calling convention things such as saving out volatile registers, but it’s easy to see how that would work. But i’m sure those reading by now will be wondering how I get those assembly listings into my test benches in VHDL.

The TPU Assembler – TASM

I have written a 1-file assembler in c# for the current ISA of TPU. In it’s thousand lines of uncommented splendour lies an abundance of coding horrors – fit for the Terrible Processing Unit. It works perfectly well for what I want – just don’t look too deep into it.

I wrote this in a few hours early on in the project, because as you can imagine, writing out instructions forms manually is tedious. The assembler is very simple and is fully self contained without any dependencies. It contains definitions for instructions, how to parse instruction forms, and how to write out their binary representation.

The functional flow for the assembler is as follows:

  1. Parse arguments and open input file
  2. for each line in the input file
    1. if it starts with a ‘#’, ignore it as a comment.
    2. split the line into strings by whitespace and commas
    3. If the first element ends with a ‘:’ treat it as a label and note it’s location
    4. Add the rest as instruction definitions to a list of inputs
  3. For each input definition, replace label names with actual values
  4. parse all definitions into a list of Operation Data objects
  5. Open output file
  6. Output the instruction data using a particular format generator

Assembler Features

The assembler accepts instruction mnemonics as per the ISA document, but will accept some additional ones – like add, which is simply treated as add.u.

There is a data definition (data/dw) which outputs 16-bit hex values directly to the instruction stream, it accepts outputting labels as absolute ($ prefix) and relative (% prefix), but does not currently support the ability to set the current location in memory of definitions – the first line is location 0x0000, and it continues from there.

Errors are not handled gracefully, and there is no real input checking. You could pass a relative offset into a conditional branch which is outside of the bounds of the instruction, and it will generate incorrect code. I’ll fix this stuff at a later date.

Output from the assembler is either binary, hex, or ‘eram’. The Embedded Ram (eram) format is basically VHDL initialization, with the original listing and offsets as comments. The example above assembles to the following:

X"8F27", -- 0000: load.l  r7 0x27 # Top of the stack
X"8307", -- 0001: load.l  r1 7 # constant argument 1
X"8503", -- 0002: load.l  r2 3 # constant argument 2
X"1EE7", -- 0003: subi    r7 r7 3 # reserve 3 words of stack
X"70E6", -- 0004: write   r7 r1 2 # write argument at offset +2
X"70E9", -- 0005: write   r7 r2 1 # write argument at offset +1
X"EC00", -- 0006: spc     r6 # get current pc
X"0CC9", -- 0007: addi    r6 r6 4 # offset to after the call
X"70F8", -- 0008: write   r7 r6 # put return PC on stack
X"C10C", -- 0009: bi      0x000c # call
X"0EE7", -- 000A: addi    r7 r7 3 # pop stack
X"C117", -- 000B: bi      0x0017
X"62E2", -- 000C: read    r1 r7 2
X"64E1", -- 000D: read    r2 r7 1
X"8100", -- 000E: load.l  r0 0
X"9A48", -- 000F: cmp.u   r5 r2 r2
X"D3A4", -- 0010: bro.az  r5 4
X"0004", -- 0011: add     r0 r0 r1
X"1443", -- 0012: subi.u  r2 r2 1
X"C10F", -- 0013: bi      0x000f
X"6CE0", -- 0014: read    r6 r7 0
X"C0C0", -- 0015: br      r6
X"C116", -- 0016: bi      0x0016
X"2000", -- 0017: or      r0 r0 r0
X"C117", -- 0018: bi      0x0017

And this is simply pasted into our VHDL ram objects. We need to pad it out to the correct size of the ram – but that is something I want to add as a feature, so you pass in the size of the eRAM and it automatically initializes the rest to zero. We can then simulate and see the TPU running well with the ISA additions.

mul16_simWrapping Up

I hope this has shown how easy it was to go in and fix some ISA mistakes made in the past and implement some new functionality. Also, it’s been nice to introduce TASM, despite the assembler itself being about as robust as a matchstick house.

The changes made to the VHDL has increased the resource requirement of the TPU on a Spartan6 LX25 from 3% to 5%, but an increase was expected given so many additional adders.

For next steps, I’m going to concentrate on the top-level VHDL entities for further deployment to miniSpartan6+.

Thanks for reading, comments as always to @domipheus.

Designing a CPU in VHDL, Part 7: Memory Operations, Running on FPGA

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

Memory Operations

We already have a small RAM which holds our instruction stream, but our TPU ISA defines memory read and write instructions, and we should get those instructions working.

It’s the last major functional implementation we need to complete.

pipe7The fetch stage is simply a memory read with the PC on our address bus. It gives a cycle of latency to allow for our instruction to appear on the data out bus of the RAM, ready for decoding. When we encounter a memory ALU operation, we need the control unit to activate the memory stage of the pipeline, which sits after Execute and before Writeback. The way we want this implemented is that the ALU calculates the memory address during execute, and that address is read during the memory stage, and the data passed to the register file during writeback. For a memory write, the ALU calculates the address, and the data we want to write is always on the dataB bus output from the register file, so we connect that up to the memory input bus.

The control unit is modified to add in the memory stage, and also take the ALU operation as an input to do that check. You can see the new unit here.

The Memory Subsystem

Because we now touch memory in multiple pipeline stages, we need to start routing our signals and selecting destinations depending on the current control state. There are various signal inputs that now come from multiple sources:

  1. Register File data input needs to be either dataResult from ALU, or dataReadOutput(ramRData) from memory – when a memory read.
  2. The Instruction Decoder needs connected to the dataReadOutput(ramRData) from memory, as the decoder only decodes during the correct pipeline stage, we don’t care that the input may be different – as long as the instruction data is correct at the decode stage.
  3. The memory write bit needs to know when we are performing a memory write instruction, and not a read.
  4. Memory writes also need to assign the dataWriteInput(ramWData) port with the data we need – contents of the rB register.
  5. The Address sent to the memory needs to be the current PC during fetch, and dataResult when a memory operation.

We can try this without making another functional unit, by just doing some assignments in our test bench source.

ramAddr <= dataResult when en_memory = '1' else PC;
ramWData <= dataB;
ramWE <= '1' when en_memory = '1' and aluop(4 downto 1) = OPCODE_WRITE else '0';

registerWriteData <= ramRData when en_regwrite = '1' and aluop(4 downto 1) = OPCODE_READ else dataResult;
instruction <= ramRData;

Simulation

we use our existing test bench, with our additional memory system signals. We have a new test instruction stream which we have loaded into the memory which looks like this:

signal ram: store_t := (
  OPCODE_XOR & "000" & '0' & "000" & "000" & "00",
  OPCODE_LOAD & "001" & '1' & X"0f",
  OPCODE_LOAD & "010" & '1' & X"0e",
  OPCODE_LOAD & "110" & '1' & X"0b",
  OPCODE_READ & "100" & '0' & "010" & "100" & "00",
  OPCODE_READ & "101" & '0' & "001" & "100" & "00",
  OPCODE_SUB & "101" & '0' & "101" & "100" & "00",
  OPCODE_WRITE & "000" & '0' & "001" & "101" & "00",
  OPCODE_CMP & "111" & '0' & "101" & "101" & "00",
  OPCODE_JUMPEQ & "000" & '0' & "111" & "110" & "01",
  OPCODE_JUMP & "000" & '1' & X"05",
  OPCODE_JUMP & "000" & '1' & X"0b",
  X"0000",
  X"0000",
  X"0001",
  X"0006"
);

Which, in TPU assembly resembles:

  xor r0, r0, r0
  load.l r1, 0x0f
  load.l r2, 0x0e
  load.l r6, $fin
  read r4, r2
loop:
  read r5, r1
  sub.u r5, r5, r4
  write r1, r5
  cmp.u r7, r5, r5
  jaz r7, r6
  jump $loop
fin:
  jump $fin

  .loc 0x0e
  data 0x0001
  .loc 0x0f
  data 0x0006

This means we expect to see 0x0000 in the memory location 0x0f after 6 iterations of the loop. From the waveform we can see computation finishes within the simulation time. We can go into the memory view of ISim and we see the result is in the correct place.

first_simThis simulation works with one cycle of memory latency, when using our embedded RAM. If we wanted to go to an external ram such as the DRAM on miniSpartan6+, we’d need to introduce multiple cycles of latency. For this, we should stall the pipeline whilst memory operations complete. We won’t go into that just now, as I think we need to take a step back, and look at the top level view of TPU and try to get what we have on an FPGA.

Top level view

highlevelWith everything built to date, we can see a pretty general outline of a CPU, with the various control lines, data lines, selects, etc. With this implemented as a black box ‘core’, we can try to implement our CPU in such a way that we can view a working test on actual miniSpartan6+ hardware.

Creating a top level block for FPGA hardware

minispartan6The miniSpartan6+ board has 4 switches and 8 LEDs. The top-level block I created has the clock input, the 4 switch inputs and the 8 LED outputs. I still used the embedded RAM. The code within this block resembles the test bench, except there is a process for detecting when the RAM address line is 0x1000 and writing the data to the LED output pins. I use one of the switch inputs to drive the reset line, which actually doesn’t reset the CPU – it simply resets the control unit. As our registers do not get reset, execution continues once reset is deactivated with some existing state present.

The top level entity definition looks like the following:

entity leds_switch_test_expand is
  Port ( I_clk : in  STD_LOGIC;
         I_switch : in  STD_LOGIC_VECTOR (3 downto 0);
         O_leds : out  STD_LOGIC_VECTOR (7 downto 0));
end leds_switch_test_expand;

And pretty much everything remains the same as the simulation test bench, except we no longer use the simulated clock, and we hack in our LED memory mapping:

process(I_clk, O_address)
begin
  if rising_edge(I_clk) then
    if (O_address = X"1000") then
      leds <= dataB(7 downto 0);
    end if;
  end if;
end process;

O_leds <= leds(7 downto 1) & I_reset;
I_reset <= I_switch(0);

As you can see, I use the first led to indicate the state of the reset line, which is useful.

With this new top level entity, we can create a test bench and write a very small code example to write a counter to the LED memory location. The code example below simulates and we see the LED output change. I force initialize the LEDs signal to a known good value as a debugging aid.

  load.l r0, 0x01
  load.l r1, 0x01
  load.h r6, 0x10
loop:
  write r6, r0
  add.u r0, r0, r1
  jump $loop

leds_test_simNow we need to look at how we get this VHDL design actually onto the hardware.

Using the miniSpartan6+ board from Windows

There is a great guide for getting the board running from Michael Field who runs the hamsterworks.co.nz wiki. You should give it a visit! The page in question is the miniSpartan6+ bringup.

I use the exact same method to get the .bit programming files onto the FPGA. This method needs done every time you power the FPGA – it doesn’t write the flash, which would allow for the FPGA design to remain across power resets. Getting that working is for another day.

As explained in the bringup guide, we need to create a ‘User Constraints File’ which at a simple level maps the input and outputs of our entity to real pins on the board. Looking at the miniSpartan6+ schematic we can see what pins are connected where, for example LED6 is connected to the ‘location’ P7.

switch_led_schematic_pinsThere is a full UCF available for the miniSpartan6+ here[https://github.com/scarabhardware/miniSpartan6-plus/blob/master/projects/miniSpartan6-plus.ucf], and we can use a subset of it for our uses.

NET "I_clk" PERIOD = 20 ns | LOC = "K3";

NET "O_LEDS<0>" LOC="P11" | IOSTANDARD=LVTTL | DRIVE=8 | SLEW=SLOW;
NET "O_LEDS<1>" LOC="N9"  | IOSTANDARD=LVTTL | DRIVE=8 | SLEW=SLOW;
NET "O_LEDS<2>" LOC="M9"  | IOSTANDARD=LVTTL | DRIVE=8 | SLEW=SLOW;
NET "O_LEDS<3>" LOC="P9"  | IOSTANDARD=LVTTL | DRIVE=8 | SLEW=SLOW;
NET "O_LEDS<4>" LOC="T8"  | IOSTANDARD=LVTTL | DRIVE=8 | SLEW=SLOW;
NET "O_LEDS<5>" LOC="N8"  | IOSTANDARD=LVTTL | DRIVE=8 | SLEW=SLOW;
NET "O_LEDS<6>" LOC="P8"  | IOSTANDARD=LVTTL | DRIVE=8 | SLEW=SLOW;
NET "O_LEDS<7>" LOC="P7"  | IOSTANDARD=LVTTL | DRIVE=8 | SLEW=SLOW;

NET "I_SWITCH<0>"   LOC="L1" | IOSTANDARD=LVTTL | PULLUP;
NET "I_SWITCH<1>"   LOC="L3" | IOSTANDARD=LVTTL | PULLUP;
NET "I_SWITCH<2>"   LOC="L4" | IOSTANDARD=LVTTL | PULLUP;
NET "I_SWITCH<3>"   LOC="L5" | IOSTANDARD=LVTTL | PULLUP;

The PULLUP parts of the I_SWITCH definitions is very important. My first try at creating this file (before I found the full UCF file on github) omitted the PULLUP, which was never going to work.

pullupWithout the PULLUP, regardless of the switch position, we’ll never get logic ‘1’ at the input. The hatched box happens inside the FPGA, pulling the value to ‘1’ when the switch is not connected to ground. Which is what you want!

generate_prog_fileNow we have our UCF file done, we want to build our ‘Programming File’ which gets uploaded to our FPGA. We make our entity the top module by right clicking it within Implementation mode and selection the option. This unlocks the synthesis options, and we run the ‘Generate Programming File’ option. This can take some time, and will raise warnings, but it completes without error. The steps taken to generate the file are below (taken from Xilinx tutorials)

Synthesis – ‘compiles’ the HDL into netlists and other structures
Translate – merges the incoming netlists and constraints into a Xilinx® design file.
Map – fits the design into the available resources on the target device, and optionally, places the design.
Place and Route – places and routes the design to the timing constraints.
Generate Programming File – creates a bitstream file that can be downloaded to the device.

First Flash

The first time I flashed the FPGA, I was stumped as to why the LEDS were remaining on (apart from the reset LED). Then it became obvious. The clock input is 50MHz. There is no way, with the CPU running that fast, we can see the LEDs change!

Frequency Divider

I solved this by adding a frequency divider into the VHDL. The 50MHz I_clk from the ‘outside world’ is slowed down using a very simple module, which basically counts and uses a bit high up the counter as an output clock. This clock output is then what’s fed into the TPU functional units such as the decoder, as the core_clock in the design. The frequency divider is as follows:

entity clock_divider is
port (
	clk: in std_logic;
	reset: in std_logic;
	clock_out: out std_logic);
end clock_divider;

architecture Behavioral of clock_divider is
  signal scaler : std_logic_vector(23 downto 0) := (others => '0');
begin

  process(clk)
  begin
    if rising_edge(clk) then   -- rising clock edge
        scaler <= std_logic_vector( unsigned(scaler) + 1);
    end if;
  end process;

clock_out <= scaler(16);

end Behavioral;

Using that divider, it works, and we get counting LEDS!

Wrapping Up

I’ll put the full example top module on github (soon!) as an example, but there is more work to be done in getting it a bit more robust, making the memory mapping actually really mapped (at the moment, a write still actually happens in the RAM but we don’t care or break on it).

For now, it’s pretty cool to see code actually running on a TPU on the FPGA hardware. Additionally, it only uses 3% of the slice resources of the LX25 Spartan6 FPGA, so lots more space to do other things with!

Thanks for reading, comments as always to @domipheus.

Designing a CPU in VHDL, Part 6: Program Counter, Instruction Fetch, Branching

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

The last part paved the way for getting this simple CPU self sustaining. This means that the test bench doesn’t feed instructions into the decoder, the CPU itself requests and fetches from a RAM somewhere. There is quite a bit of work to this in terms of the various ways of connecting up a RAM to the CPU and having a unit to manage the program counter and the fetching of the next one.

The Program Counter unit

The PC is just a register containing the location of the currently executing instruction. However, we need to operate on it, so will create a unit to manage this.

Our PC unit will obviously hold the current PC, and on command increment it. It will have an input for setting the next PC value, and also the ability to stop – stay at the same location – which we need due to our pipeline being several cycles long.

With different possibilities for the PC unit operating, we can have an input to it which dictates the operation on the PC to perform:

  • Increment PC
  • Set PC to new value
  • Do nothing (halt)
  • Set PC to our reset vector, which is 0x0000.

We can use a 2-bit opcode input to select one of these operations. Our PC unit then looks like this functional unit.

pcunit_functionalWe get back into the Xilinx ISE project, adding out new pc_unit.vhd with the various input and output ports we need.

entity pc_unit is
    Port ( I_clk : in  STD_LOGIC;
           I_nPC : in  STD_LOGIC_VECTOR (15 downto 0);
           I_nPCop : in  STD_LOGIC_VECTOR (1 downto 0);
           O_PC : out  STD_LOGIC_VECTOR (15 downto 0)
           );
end pc_unit;

The process triggered by the rising edge of the input clock will check the input I_nPCop port for an opcode on which to execute on the PC value. We use a case statement like previous examples in the series. We also have an internal signal to act as a register holding the current PC.

architecture Behavioral of pc_unit is
  signal current_pc: std_logic_vector( 15 downto 0) := X"0000";
begin

  process (I_clk)
  begin
    if rising_edge(I_clk) then
      case I_nPCop is
        when PCU_OP_NOP => 	-- NOP, keep PC the same/halt
        when PCU_OP_INC => 	-- increment
          current_pc <= std_logic_vector(unsigned(current_pc) + 1);
        when PCU_OP_ASSIGN => 	-- set from external input
          current_pc <= I_nPC;
        when PCU_OP_RESET => 	-- Reset
          current_pc <= X"0000";
        when others =>
      end case;
    end if;
  end process;

  O_PC <= current_pc;

end Behavioral;

The constants are in tpu_constants.vhd with the others:

-- PC unit opcodes
constant PCU_OP_NOP: std_logic_vector(1 downto 0):= "00";
constant PCU_OP_INC: std_logic_vector(1 downto 0):= "01";
constant PCU_OP_ASSIGN: std_logic_vector(1 downto 0):= "10";
constant PCU_OP_RESET: std_logic_vector(1 downto 0):= "11";

This simple unit should allow us to reset the PC, increment during normal operation, branch to a new location when we need to, and also halt and spin on the same PC value.

Integrating a new Testbench

Early in the series we got a test ‘RAM’ made up. We’re going to take that code, and copy it into a copy of our existing test bench – but we will initialize the RAM with our test instruction stream. It will look like the following:

entity ram_tb is
    Port ( I_clk : in  STD_LOGIC;
           I_we : in  STD_LOGIC;
           I_addr : in  STD_LOGIC_VECTOR (15 downto 0);
           I_data : in  STD_LOGIC_VECTOR (15 downto 0);
           O_data : out  STD_LOGIC_VECTOR (15 downto 0));
end ram_tb;

architecture Behavioral of ram_tb is
  type store_t is array (0 to 7) of std_logic_vector(15 downto 0);
   signal ram: store_t := (
    OPCODE_LOAD & "000" & '0' & X"fe",
    OPCODE_LOAD & "001" & '1' & X"ed",
    OPCODE_OR & "010" & '0' & "000" & "001" & "00",
    OPCODE_LOAD & "011" & '1' & X"01",
    OPCODE_LOAD & "100" & '1' & X"02",
    OPCODE_ADD & "011" & '0' & "011" & "100" & "00",
    OPCODE_OR & "101" & '0' & "000" & "011" & "00",
    OPCODE_AND & "101" & '0' & "101" & "010" & "00"
    );
begin

  process (I_clk)
  begin
    if rising_edge(I_clk) then
      if (I_we = '1') then
        ram(to_integer(unsigned(I_addr(2 downto 0)))) <= I_data;
      else
        O_data <= ram(to_integer(unsigned(I_addr(2 downto 0))));
      end if;
    end if;
  end process;

end Behavioral;

We can then initialize the unit in our main test bench body. The PC unit should be added too.

component ram_tb
Port ( I_clk : in  STD_LOGIC;
      I_we : in  STD_LOGIC;
      I_addr : in  STD_LOGIC_VECTOR (15 downto 0);
      I_data : in  STD_LOGIC_VECTOR (15 downto 0);
      O_data : out  STD_LOGIC_VECTOR (15 downto 0)
      );
end component;

COMPONENT pc_unit
PORT(
     I_clk : IN  std_logic;
     I_nPC : IN  std_logic_vector(15 downto 0);
     I_nPCop : IN  std_logic_vector(1 downto 0);
     O_PC : OUT std_logic_vector(15 downto 0)
    );
END COMPONENT;


signal ramWE : std_logic := '0';
signal ramAddr: std_logic_vector(15 downto 0);
signal ramRData: std_logic_vector(15 downto 0);
signal ramWData: std_logic_vector(15 downto 0);

signal nPC: std_logic_vector(15 downto 0);
signal pcop: std_logic_vector(1 downto 0);
signal in_pc: std_logic_vector(15 downto 0);

After the begin statement, we define the parts and map the ports to respective signals.

uut_ram: ram_tb Port map (
  I_clk => I_clk,
  I_we => ramWE,
  I_addr => ramAddr,
  I_data => ramWData,
  O_data => ramRData
);
uut_pcunit: pc_unit Port map (
  I_clk => I_clk,
  I_nPC => in_pc,
  I_nPCop => pcop,
  O_PC => PC
);

We don’t use the ramWData or ramWE ports, so assign them to some default values. the ramAddr port should always be assigned to PC, and the instruction port we used to set manually we will assign from ramRData – the output from our ram. We put these assignments outside of a process – they are always true.

ramAddr <= PC;
ramWData <= X"FFFF";
ramWE <= '0';
instruction <= ramRData;

The last input we need to figure out is the PC unit operation. What we need to do is tie this to the current control unit state – in that, we want it to increment at a given point, and we want the rest of the time to be in the NOP mode.

pcop<= PCU_OP_RESET when reset = '1' else PCU_OP_INC when state(2) = '1' else PCU_OP_NOP;

This resets the PC unit when there is a control unit reset, increments on the ALU execute stage of the pipeline, and nops otherwise.

As for the testbench itself, we no longer feed the instruction signal with our instructions, so we remove all that. We have a short reset, and then wait until our PC is 8 – as then we have overflowed memory, and our test is complete. The ‘CPU’ should be self sustaining, execute everything up until the PC equals 0x8, then go into an indefinite reset state.

stim_proc: process
begin		

  reset<='1'; -- reset control unit
  wait for I_clk_period; -- wait a cycles
  reset <= '0';

  wait until PC = X"0008"; -- we only have 8 instructions
  reset <= '1'; -- reset control unit
  wait;

end process;

Getting back to the PC unit, the point in the pipeline when we increment is very important. We need to account for any latency, for all uses of the PC – as shown in the next two simulation captures.

latest_onlyPC_inc_at_exec_no_fetch latest_onlyPC_inc_at_writeback_no_fetchYou can see when the increment happens at the writeback phase, we have an issue that the instructions are delayed and subsequently offset from each other. The problem here is that the increment needs to happen at the writeback, as the ALU in a branch instruction could issue an assign to PC – and in the first image (at yellow circles) you can see where we’d want to increment/set the ALU has not got it’s result ready.

The Fetch Stage

Previously we had a cloud of uncertainty around the fetch area of the instruction pipeline. We need to add it now, to account for any latencies in fetching the opcode for the decode stage. I’m adding it by extending our simple control unit, with two extra bits of state. The second bit will be for the memory stage of the pipeline which we’ll discuss later, but best to add it now. Simply extend the std_logic_vector, add the extra case statements, and extend the outputs where required. We ignore the last state switch – the memory stage. In the test bench source, we need to add an en_fetch and also change en_decode, en_regread, etc, to reflect new bit positions. They are just shifted up one.

If we change our assignment to pcop, to now inc on the writeback phase (bit 4 now, due to fetch now being bit 0):

pcop <= PCU_OP_RESET when reset = '1' else
        PCU_OP_INC when state(4) = '1' else
        PCU_OP_NOP;

Simulating it results in the following output:

fetchstage_inc_writebackA) increment PC, on previous writeback cycle. 0x1 now on ram address bus.
B) ram puts value at 0x1 on read output bus, appears at decoder instruction input in time for decode cycle.
C) decoder outputs results from decode
D) result from alu ready for writeback.

We can see this extra fetch cycle of latency allows for the increment/PC operation to occur as late as possible, with the ALU result also available when this happens.

Branching

So now, what about branching? We need to set the pc to the output of the alu, if shouldBranch is true. This is, surprisingly, very easy to achieve. We simply amend the assignment to the pcop to add the assign condition:

pcop <= PCU_OP_RESET when reset = '1' else
        PCU_OP_ASSIGN when shouldBranch = '1' and state(4) = '1' else
        PCU_OP_INC when shouldBranch = '0' and state(4) = '1' else
        PCU_OP_NOP;

We will change our last instruction in the RAM to jump back to our addition, to make an infinite loop.

signal ram: store_t := (
  OPCODE_LOAD & "000" & '0' & X"fe",
  OPCODE_LOAD & "001" & '1' & X"ed",
  OPCODE_OR & "010" & '0' & "000" & "001" & "00",
  OPCODE_LOAD & "011" & '1' & X"01",
  OPCODE_LOAD & "100" & '1' & X"02",
  OPCODE_ADD & "011" & '0' & "011" & "100" & "00",
  OPCODE_OR & "101" & '0' & "000" & "011" & "00",
  OPCODE_JUMP & "000" & '1' & X"05"
  );

We also need to remove the ‘wait until PC = X”0008″;’ from the testbench. The test code effectively becomes just a reset and a wait. We must also now set the PC input to the PC unit to our ALU output, so we can assign the PC to new values.


in_pc <= dataResult;

stim_proc: process
 begin		

  reset<='1'; -- reset control unit
  wait for I_clk_period; -- wait a cycles
  reset <= '0';

  wait;
end process;

And that will then produce this rather fun simulation – self sustaining, with branching!

self_sustain_branchingI thought it would be useful to go and zoom into the branch on that simulation, and step through the transitions.

latest_onlyPC_inc_at_write_with_fetch_see_correct_branch_2At (A), the ALU has decided we should branch. As we are in the writeback phase of the pipeline, pcop becomes assign instead of increment at (B), putting dataresult on in_pc. the assign happens, and the next cycle, now in fetch (C), we have the new instruction being decoded from our branch target.

To show issues that can occur without correct timing, we have a simulation without the fetch stage and PC operation at the ALU stage. The branch happens, eventually, but the CPU executes an additional instruction first. Some architectures actually do this, and you need to fill out the instruction stream after branches with nops or other useful operations. Look up Delay Slots if you are interested in learning more. For the TPU design, I didn’t want this.

latest_onlyPC_inc_at_exec_no_fetch_see_wrong_branchConditional branching

We may as well check whether conditional branches work. For this, we just create a new test with a larger RAM, and fill it with the following code:

signal ram: store_t := (
  OPCODE_LOAD & "001" & '1' & X"05",
  OPCODE_LOAD & "010" & '1' & X"03",
  OPCODE_XOR & "000" & '0' & "000" & "000" & "00",
  OPCODE_LOAD & "011" & '1' & X"01",
  OPCODE_LOAD & "110" & '1' & X"0b",
  OPCODE_OR & "100" & '0' & "010" & "010" & "00",
  OPCODE_CMP & "101" & '0' & "100" & "000" & "00",
  OPCODE_JUMPEQ & "000" & '0' & "101" & "110" & "01",
  OPCODE_SUB & "100" & '0' & "100" & "011" & "00",
  OPCODE_ADD & "000" & '0' & "000" & "001" & "00",
  OPCODE_JUMP & "000" & '1' & X"06",
  OPCODE_JUMP & "000" & '1' & X"0b",
  X"0000",
  X"0000",
  X"0000",
  X"0000"
);

This is code for a simple multiply, with the operands in r1 and r2, the result going in r0.

0x00    load.l r1, 0x05
0x01    load.l r2, 0x03
0x02    xor    r0, r0, r0
0x03    load.l r3, 0x01
0x04    load.l r6, $END
0x05    or     r4, r2, r2
      loop:
0x06    cmp.u    r5, r4, r0
0x07    jump.aez r5, r6
0x08    sub.u    r4, r4, r3
0x09    add.u    r0, r0, r1
0x0a    jump $loop
      end:
0x0b    jump $end

It simulates, and it works. You can see the branch structure from the simulation quite easily, and the result is valid.

self_sustain_conditional_branchingThere are a few ISA issues apparent when we start drawing up ‘more complex’ TPU assembly examples. Having some sort of simple add/sub immediate and a conditional jump to relative offset would reduce our instruction count. But, we can always revisit and change our ISA – one of the benefits of rolling your own cpu!

Wrapping Up

So, we managed to get TPU self-sustaining – that is, it chooses which instructions to consume, and does so indefinitely. The test bench essentially becomes a wait, as the CPU decides what to do based on the instruction stream – It’s a real CPU!

To do this, we had to add our fetch stage to the pipeline, and also investigate issues with when changes to the PC occurred. We added our PC unit to track this, and it’s fairly simple but allows for normal sequential execution and branching.

An issue with our CPU is that it’s still relatively limited by number of registers, and the fact we cannot get any real input other than instructions. To get more, we need to give the CPU access to manipulate memory. And with that, possibly memory-mapped peripherals. So we’ll look into that next.

Thanks for reading, comments as always to @domipheus.

Designing a CPU in VHDL, Part 5: Pipeline and Control Unit

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

This is a disclaimer that the VHDL here is probably not the best you will see, but it gets the job done – in the simulator, at least. If you spot any serious errors, or woeful performance gotchas I’ve fallen for – please let me know at @domipheus. The aim of these posts is to get a very simple 16-bit CPU up and running, and then get stuck into some optimization opportunities later.

We now have our decoder, ALU, and registers. We can now create a test bench to simulate what happens when we connect those together in a fixed, controllable way. The testbench source is on github, so I won’t go too much into creating the boilerplate. Create an ALU test bench, and then manually add the decoder, register file, and associated signals to the file.

Joining functional units together

With our ALU, decoder and regs entities available, we can connect them together with signals like so.

uut_decoder: decode PORT MAP (
    I_clk => I_clk,
    I_dataInst => instruction,
    I_en => en,
    O_selA => selA,
    O_selB => selB,
    O_selD => selD,
    O_dataIMM => dataIMM,
    O_regDwe => dataDwe,
    O_aluop => aluop
  );

uut_alu: alu PORT MAP (
    I_clk => I_clk,
    I_en => en,
    I_dataA => dataA,
    I_dataB => dataB,
    I_dataDwe => dataDwe,
    I_aluop => aluop,
    I_PC => PC,
    I_dataIMM => dataIMM,
    O_dataResult => dataResult,
    O_dataWriteReg => dataWriteReg,
    O_shouldBranch => shouldBranch
  );

uut_reg: reg16_8 PORT MAP (
    I_clk => I_clk,
    I_en => '1',
    I_dataD => dataResult,
    O_dataA => dataA,
    O_dataB => dataB,
    I_selA => selA,
    I_selB => selB,
    I_selD => selD,
    I_we => dataWriteReg
  );

If we include the tpu_constants file, we can use the opcode definitions to write the instruction signal before waiting a cycle. We also need to remember to enable the units using the enable ‘en’ signal, which is connected to all units.

-- Stimulus process
stim_proc: process
begin
  -- hold reset state for 100 ns.
  wait for 100 ns;	

  wait for I_clk_period*10;
  en<='1';

  --load.h r0,0xfe
  instruction <= OPCODE_LOAD & "000" & '0' & X"fe";
  wait for I_clk_period;

  --load.l r1, 0xed
  instruction <= OPCODE_LOAD & "001" & '1' & X"ed";
  wait for I_clk_period;

  --or r2, r0, r1
  instruction <= OPCODE_OR & "010" & '0' & "000" & "001" & "00";
  wait for I_clk_period;
  wait;
end process;

When simulating this, you get the wrong result. It’s fairly understandable why – we’ve designed these units to be part of a pipeline – and we’re only giving a single cycle of latency for a result at the end of it, when we need at least two cycles.

first_run_1cycleIf we edit our test to add two cycles of latency, we get a much better result.

  --load.h r0,0xfe
  instruction <= OPCODE_LOAD & "000" & '0' & X"fe";
  wait for I_clk_period;
  wait for I_clk_period;

  --load.l r1, 0xed
  instruction <= OPCODE_LOAD & "001" & '1' & X"ed";
  wait for I_clk_period;
  wait for I_clk_period;

  --or r2, r0, r1
  instruction <= OPCODE_OR & "010" & '0' & "000" & "001" & "00";
  wait for I_clk_period;
  wait for I_clk_period;

2_first_run_2cycleLooking at the memory view, for our uut_reg object, we can see r0, r1 and r2 have the correct data.

2_first_run_2cycle_regmemSuccess! But what about the following example?

--load.l r3, 1
instruction <= OPCODE_LOAD & "011" & '1' & X"01";
wait for I_clk_period;
wait for I_clk_period;

--load.l r4, 2
instruction <= OPCODE_LOAD & "100" & '1' & X"02";
wait for I_clk_period;
wait for I_clk_period;

--add.u r3, r3, r4
instruction <= OPCODE_ADD & "011" & '0' & "011" & "100" & "00";
wait for I_clk_period;
wait for I_clk_period;

You will see one item of interest here. The add instruction uses the same register as a source and destination. We tested the register file before, as we know reading from/writing to the same register is technically fine. But what when it’s part of a bigger system? It doesn’t go so well.

wrong_write_depI’ve annotated the simulation timeline with the instructions and two points of interest. Point A is the result of the add – it’s delayed by a cycle. This offsets the result for any potential next instruction – which you can see at point B.

--or r5, r0, r3
instruction <= OPCODE_OR & "101" & '0' & "000" & "011" & "00";
wait for I_clk_period;
wait for I_clk_period;

If we add the above instruction to the mix we can see how serious of an issue this can be.

wrong_write_dep_proofThe correct write into r3 doesn’t happen in time for the OR operation, so the result in r5 will be incorrect. And we can see this by looking at the uut_reg unit in memory view.

wrong_or_reg_r3We can see that with the units designed as they are, 3 cycles will be needed for safe decode – execute – writeback. Adding a third wait for I_clk_period; after the add instruction, which allows time for the writeback into the register, allows for correct operation.

right_r3_extra_writeback_cycleControl Unit

What we need for all of this, isn’t to add cycles of latency like above, but to add a unit responsible for telling other units what to do, and when. To figure out what we want, we’re going to need to lay out what we think our pipeline looks like, and what else we need.

pipeI’m going all out simple here, giving an independent cycle for every part of the pipeline. We don’t execute the next instruction until the current one has traversed every stage. We still don’t know whats happening in the fetch and memory stages, but fairly certain of the others. We add a read stage to be sure our data is ready for the ALU when needed – we can always remove states later when we optimize the completed design.

Since all of the units we’ve created so far have enable ports, we can get a control unit to synchronize everything up and drive those enable bits. The control unit will have one output, a bitmask showing the pipeline state currently active, as well as a reset and clock input. Each clock cycle the state will increment by one bit, and if reset is high it will reset to initial state. The control unit is technically a state machine, but for now it’s simple enough we can just classify it as a counter. You can do this with integers or other types, I’m just a fan of raw bit vectors.

entity controlsimple is
  Port ( I_clk : in  STD_LOGIC;
         I_reset : in  STD_LOGIC;
         O_state : out  STD_LOGIC_VECTOR (3 downto 0)
         );
end controlsimple;

architecture Behavioral of controlsimple is
  signal s_state: STD_LOGIC_VECTOR(3 downto 0) := "0001";
begin
  process(I_clk)
  begin
    if rising_edge(I_clk) then
      if I_reset = '1' then
        s_state <= "0001";
      else
        case s_state is
          when "0001" =>
            s_state <= "0010";
          when "0010" =>
            s_state <= "0100";
          when "0100" =>
            s_state <= "1000";
          when "1000" =>
            s_state <= "0001";
          when others =>
            s_state <= "0001";
        end case;
      end if;
    end if;
  end process;

  O_state <= s_state;
end Behavioral;

We can create a new testbench based off of out current decode, alu, register one, and add in the simplecontrol component. I won’t post all the code to that here, but you can see it on github. We attach the enable lines of the components to the relevant bits of our state output:

en_decode <= state(0);
en_regread <= state(1);
en_alu <= state(2);
en_regwrite <= state(3);

The register file is a bit different in terms of the enable bit hookup, as to accommodate two states in the pipeline. The enable bit is tied to both read and write states, but we ensure writes only happen at the correct stage by using it in the write enable input.

uut_reg: reg16_8 PORT MAP (
  I_clk => I_clk,
  I_en => en_regread or en_regwrite,
  I_dataD => dataResult,
  O_dataA => dataA,
  O_dataB => dataB,
  I_selA => selA,
  I_selB => selB,
  I_selD => selD,
  I_we => dataWriteReg and en_regwrite
);

Now we need slight changes to the test bench process itself. We populate our first instruction whilst the control unit is in a reset state, before enabling it. Then, instead of waiting after each instruction is assigned by a specific number of clock cycles, we wait until the register writeback state is reached. This is the last part of the pipeline, and is the MSB of our state output from the control unit – signal en_regwrite in our test. At this point, it’s safe to assign the next instruction to the decoder input. We repeat this for all instructions like so:

reset <= '1'; -- reset control unit
--load.h r0,0xfe
instruction <= OPCODE_LOAD & "000" & '0' & X"fe";
reset <= '0'; -- enable/start control unit
wait until en_regwrite = '1';

--load.l r1, 0xed
instruction <= OPCODE_LOAD & "001" & '1' & X"ed";
wait until en_regwrite = '1';

--or r2, r0, r1
instruction <= OPCODE_OR & "010" & '0' & "000" & "001" & "00";
wait until en_regwrite = '1';

--load.l r3, 1
instruction <= OPCODE_LOAD & "011" & '1' & X"01";
wait until en_regwrite = '1';

--load.l r4, 2
instruction <= OPCODE_LOAD & "100" & '1' & X"02";
wait until en_regwrite = '1';

--add.u r3, r3, r4
instruction <= OPCODE_ADD & "011" & '0' & "011" & "100" & "00";
wait until en_regwrite = '1';

--or r5, r0, r3
instruction <= OPCODE_OR & "101" & '0' & "000" & "011" & "00";
wait until en_regwrite = '1';

control_1Running this in the simulator, we get a nice output waveform. All instructions executed correctly, and ran perfectly in sync. The registers at the end of the simulation are listed in the image above, with a red line linking it to when that particular register was written. I’ve annotated the pipeline stages too. You can see that the next instruction gets assigned to the decoder input in the writeback stage, so there is a little bit of overlap in the pipeline, but that’s to be expected. It’s not real concurrent operation within the pipeline, it’s just ensuring the data is available when it’s needed.

Wrapping Up

Hopefully this has shown the importance of the control unit, and how it conducts the process of execution in this CPU. It has many more functions to manage – especially when we start implementing memory operations, and dealing with that shouldBranch ALU output – more on that later!

Thanks for reading, comments as always to @domipheus.