Designing a CPU in VHDL, Part 7: Memory Operations, Running on FPGA

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

Memory Operations

We already have a small RAM which holds our instruction stream, but our TPU ISA defines memory read and write instructions, and we should get those instructions working.

It’s the last major functional implementation we need to complete.

pipe7The fetch stage is simply a memory read with the PC on our address bus. It gives a cycle of latency to allow for our instruction to appear on the data out bus of the RAM, ready for decoding. When we encounter a memory ALU operation, we need the control unit to activate the memory stage of the pipeline, which sits after Execute and before Writeback. The way we want this implemented is that the ALU calculates the memory address during execute, and that address is read during the memory stage, and the data passed to the register file during writeback. For a memory write, the ALU calculates the address, and the data we want to write is always on the dataB bus output from the register file, so we connect that up to the memory input bus.

The control unit is modified to add in the memory stage, and also take the ALU operation as an input to do that check. You can see the new unit here.

The Memory Subsystem

Because we now touch memory in multiple pipeline stages, we need to start routing our signals and selecting destinations depending on the current control state. There are various signal inputs that now come from multiple sources:

  1. Register File data input needs to be either dataResult from ALU, or dataReadOutput(ramRData) from memory – when a memory read.
  2. The Instruction Decoder needs connected to the dataReadOutput(ramRData) from memory, as the decoder only decodes during the correct pipeline stage, we don’t care that the input may be different – as long as the instruction data is correct at the decode stage.
  3. The memory write bit needs to know when we are performing a memory write instruction, and not a read.
  4. Memory writes also need to assign the dataWriteInput(ramWData) port with the data we need – contents of the rB register.
  5. The Address sent to the memory needs to be the current PC during fetch, and dataResult when a memory operation.

We can try this without making another functional unit, by just doing some assignments in our test bench source.

ramAddr <= dataResult when en_memory = '1' else PC;
ramWData <= dataB;
ramWE <= '1' when en_memory = '1' and aluop(4 downto 1) = OPCODE_WRITE else '0';

registerWriteData <= ramRData when en_regwrite = '1' and aluop(4 downto 1) = OPCODE_READ else dataResult;
instruction <= ramRData;

Simulation

we use our existing test bench, with our additional memory system signals. We have a new test instruction stream which we have loaded into the memory which looks like this:

signal ram: store_t := (
  OPCODE_XOR & "000" & '0' & "000" & "000" & "00",
  OPCODE_LOAD & "001" & '1' & X"0f",
  OPCODE_LOAD & "010" & '1' & X"0e",
  OPCODE_LOAD & "110" & '1' & X"0b",
  OPCODE_READ & "100" & '0' & "010" & "100" & "00",
  OPCODE_READ & "101" & '0' & "001" & "100" & "00",
  OPCODE_SUB & "101" & '0' & "101" & "100" & "00",
  OPCODE_WRITE & "000" & '0' & "001" & "101" & "00",
  OPCODE_CMP & "111" & '0' & "101" & "101" & "00",
  OPCODE_JUMPEQ & "000" & '0' & "111" & "110" & "01",
  OPCODE_JUMP & "000" & '1' & X"05",
  OPCODE_JUMP & "000" & '1' & X"0b",
  X"0000",
  X"0000",
  X"0001",
  X"0006"
);

Which, in TPU assembly resembles:

  xor r0, r0, r0
  load.l r1, 0x0f
  load.l r2, 0x0e
  load.l r6, $fin
  read r4, r2
loop:
  read r5, r1
  sub.u r5, r5, r4
  write r1, r5
  cmp.u r7, r5, r5
  jaz r7, r6
  jump $loop
fin:
  jump $fin

  .loc 0x0e
  data 0x0001
  .loc 0x0f
  data 0x0006

This means we expect to see 0x0000 in the memory location 0x0f after 6 iterations of the loop. From the waveform we can see computation finishes within the simulation time. We can go into the memory view of ISim and we see the result is in the correct place.

first_simThis simulation works with one cycle of memory latency, when using our embedded RAM. If we wanted to go to an external ram such as the DRAM on miniSpartan6+, we’d need to introduce multiple cycles of latency. For this, we should stall the pipeline whilst memory operations complete. We won’t go into that just now, as I think we need to take a step back, and look at the top level view of TPU and try to get what we have on an FPGA.

Top level view

highlevelWith everything built to date, we can see a pretty general outline of a CPU, with the various control lines, data lines, selects, etc. With this implemented as a black box ‘core’, we can try to implement our CPU in such a way that we can view a working test on actual miniSpartan6+ hardware.

Creating a top level block for FPGA hardware

minispartan6The miniSpartan6+ board has 4 switches and 8 LEDs. The top-level block I created has the clock input, the 4 switch inputs and the 8 LED outputs. I still used the embedded RAM. The code within this block resembles the test bench, except there is a process for detecting when the RAM address line is 0x1000 and writing the data to the LED output pins. I use one of the switch inputs to drive the reset line, which actually doesn’t reset the CPU – it simply resets the control unit. As our registers do not get reset, execution continues once reset is deactivated with some existing state present.

The top level entity definition looks like the following:

entity leds_switch_test_expand is
  Port ( I_clk : in  STD_LOGIC;
         I_switch : in  STD_LOGIC_VECTOR (3 downto 0);
         O_leds : out  STD_LOGIC_VECTOR (7 downto 0));
end leds_switch_test_expand;

And pretty much everything remains the same as the simulation test bench, except we no longer use the simulated clock, and we hack in our LED memory mapping:

process(I_clk, O_address)
begin
  if rising_edge(I_clk) then
    if (O_address = X"1000") then
      leds <= dataB(7 downto 0);
    end if;
  end if;
end process;

O_leds <= leds(7 downto 1) & I_reset;
I_reset <= I_switch(0);

As you can see, I use the first led to indicate the state of the reset line, which is useful.

With this new top level entity, we can create a test bench and write a very small code example to write a counter to the LED memory location. The code example below simulates and we see the LED output change. I force initialize the LEDs signal to a known good value as a debugging aid.

  load.l r0, 0x01
  load.l r1, 0x01
  load.h r6, 0x10
loop:
  write r6, r0
  add.u r0, r0, r1
  jump $loop

leds_test_simNow we need to look at how we get this VHDL design actually onto the hardware.

Using the miniSpartan6+ board from Windows

There is a great guide for getting the board running from Michael Field who runs the hamsterworks.co.nz wiki. You should give it a visit! The page in question is the miniSpartan6+ bringup.

I use the exact same method to get the .bit programming files onto the FPGA. This method needs done every time you power the FPGA – it doesn’t write the flash, which would allow for the FPGA design to remain across power resets. Getting that working is for another day.

As explained in the bringup guide, we need to create a ‘User Constraints File’ which at a simple level maps the input and outputs of our entity to real pins on the board. Looking at the miniSpartan6+ schematic we can see what pins are connected where, for example LED6 is connected to the ‘location’ P7.

switch_led_schematic_pinsThere is a full UCF available for the miniSpartan6+ here[https://github.com/scarabhardware/miniSpartan6-plus/blob/master/projects/miniSpartan6-plus.ucf], and we can use a subset of it for our uses.

NET "I_clk" PERIOD = 20 ns | LOC = "K3";

NET "O_LEDS<0>" LOC="P11" | IOSTANDARD=LVTTL | DRIVE=8 | SLEW=SLOW;
NET "O_LEDS<1>" LOC="N9"  | IOSTANDARD=LVTTL | DRIVE=8 | SLEW=SLOW;
NET "O_LEDS<2>" LOC="M9"  | IOSTANDARD=LVTTL | DRIVE=8 | SLEW=SLOW;
NET "O_LEDS<3>" LOC="P9"  | IOSTANDARD=LVTTL | DRIVE=8 | SLEW=SLOW;
NET "O_LEDS<4>" LOC="T8"  | IOSTANDARD=LVTTL | DRIVE=8 | SLEW=SLOW;
NET "O_LEDS<5>" LOC="N8"  | IOSTANDARD=LVTTL | DRIVE=8 | SLEW=SLOW;
NET "O_LEDS<6>" LOC="P8"  | IOSTANDARD=LVTTL | DRIVE=8 | SLEW=SLOW;
NET "O_LEDS<7>" LOC="P7"  | IOSTANDARD=LVTTL | DRIVE=8 | SLEW=SLOW;

NET "I_SWITCH<0>"   LOC="L1" | IOSTANDARD=LVTTL | PULLUP;
NET "I_SWITCH<1>"   LOC="L3" | IOSTANDARD=LVTTL | PULLUP;
NET "I_SWITCH<2>"   LOC="L4" | IOSTANDARD=LVTTL | PULLUP;
NET "I_SWITCH<3>"   LOC="L5" | IOSTANDARD=LVTTL | PULLUP;

The PULLUP parts of the I_SWITCH definitions is very important. My first try at creating this file (before I found the full UCF file on github) omitted the PULLUP, which was never going to work.

pullupWithout the PULLUP, regardless of the switch position, we’ll never get logic ‘1’ at the input. The hatched box happens inside the FPGA, pulling the value to ‘1’ when the switch is not connected to ground. Which is what you want!

generate_prog_fileNow we have our UCF file done, we want to build our ‘Programming File’ which gets uploaded to our FPGA. We make our entity the top module by right clicking it within Implementation mode and selection the option. This unlocks the synthesis options, and we run the ‘Generate Programming File’ option. This can take some time, and will raise warnings, but it completes without error. The steps taken to generate the file are below (taken from Xilinx tutorials)

Synthesis – ‘compiles’ the HDL into netlists and other structures
Translate – merges the incoming netlists and constraints into a Xilinx® design file.
Map – fits the design into the available resources on the target device, and optionally, places the design.
Place and Route – places and routes the design to the timing constraints.
Generate Programming File – creates a bitstream file that can be downloaded to the device.

First Flash

The first time I flashed the FPGA, I was stumped as to why the LEDS were remaining on (apart from the reset LED). Then it became obvious. The clock input is 50MHz. There is no way, with the CPU running that fast, we can see the LEDs change!

Frequency Divider

I solved this by adding a frequency divider into the VHDL. The 50MHz I_clk from the ‘outside world’ is slowed down using a very simple module, which basically counts and uses a bit high up the counter as an output clock. This clock output is then what’s fed into the TPU functional units such as the decoder, as the core_clock in the design. The frequency divider is as follows:

entity clock_divider is
port (
	clk: in std_logic;
	reset: in std_logic;
	clock_out: out std_logic);
end clock_divider;

architecture Behavioral of clock_divider is
  signal scaler : std_logic_vector(23 downto 0) := (others => '0');
begin

  process(clk)
  begin
    if rising_edge(clk) then   -- rising clock edge
        scaler <= std_logic_vector( unsigned(scaler) + 1);
    end if;
  end process;

clock_out <= scaler(16);

end Behavioral;

Using that divider, it works, and we get counting LEDS!

Wrapping Up

I’ll put the full example top module on github (soon!) as an example, but there is more work to be done in getting it a bit more robust, making the memory mapping actually really mapped (at the moment, a write still actually happens in the RAM but we don’t care or break on it).

For now, it’s pretty cool to see code actually running on a TPU on the FPGA hardware. Additionally, it only uses 3% of the slice resources of the LX25 Spartan6 FPGA, so lots more space to do other things with!

Thanks for reading, comments as always to @domipheus.

Designing a CPU in VHDL, Part 6: Program Counter, Instruction Fetch, Branching

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

The last part paved the way for getting this simple CPU self sustaining. This means that the test bench doesn’t feed instructions into the decoder, the CPU itself requests and fetches from a RAM somewhere. There is quite a bit of work to this in terms of the various ways of connecting up a RAM to the CPU and having a unit to manage the program counter and the fetching of the next one.

The Program Counter unit

The PC is just a register containing the location of the currently executing instruction. However, we need to operate on it, so will create a unit to manage this.

Our PC unit will obviously hold the current PC, and on command increment it. It will have an input for setting the next PC value, and also the ability to stop – stay at the same location – which we need due to our pipeline being several cycles long.

With different possibilities for the PC unit operating, we can have an input to it which dictates the operation on the PC to perform:

  • Increment PC
  • Set PC to new value
  • Do nothing (halt)
  • Set PC to our reset vector, which is 0x0000.

We can use a 2-bit opcode input to select one of these operations. Our PC unit then looks like this functional unit.

pcunit_functionalWe get back into the Xilinx ISE project, adding out new pc_unit.vhd with the various input and output ports we need.

entity pc_unit is
    Port ( I_clk : in  STD_LOGIC;
           I_nPC : in  STD_LOGIC_VECTOR (15 downto 0);
           I_nPCop : in  STD_LOGIC_VECTOR (1 downto 0);
           O_PC : out  STD_LOGIC_VECTOR (15 downto 0)
           );
end pc_unit;

The process triggered by the rising edge of the input clock will check the input I_nPCop port for an opcode on which to execute on the PC value. We use a case statement like previous examples in the series. We also have an internal signal to act as a register holding the current PC.

architecture Behavioral of pc_unit is
  signal current_pc: std_logic_vector( 15 downto 0) := X"0000";
begin

  process (I_clk)
  begin
    if rising_edge(I_clk) then
      case I_nPCop is
        when PCU_OP_NOP => 	-- NOP, keep PC the same/halt
        when PCU_OP_INC => 	-- increment
          current_pc <= std_logic_vector(unsigned(current_pc) + 1);
        when PCU_OP_ASSIGN => 	-- set from external input
          current_pc <= I_nPC;
        when PCU_OP_RESET => 	-- Reset
          current_pc <= X"0000";
        when others =>
      end case;
    end if;
  end process;

  O_PC <= current_pc;

end Behavioral;

The constants are in tpu_constants.vhd with the others:

-- PC unit opcodes
constant PCU_OP_NOP: std_logic_vector(1 downto 0):= "00";
constant PCU_OP_INC: std_logic_vector(1 downto 0):= "01";
constant PCU_OP_ASSIGN: std_logic_vector(1 downto 0):= "10";
constant PCU_OP_RESET: std_logic_vector(1 downto 0):= "11";

This simple unit should allow us to reset the PC, increment during normal operation, branch to a new location when we need to, and also halt and spin on the same PC value.

Integrating a new Testbench

Early in the series we got a test ‘RAM’ made up. We’re going to take that code, and copy it into a copy of our existing test bench – but we will initialize the RAM with our test instruction stream. It will look like the following:

entity ram_tb is
    Port ( I_clk : in  STD_LOGIC;
           I_we : in  STD_LOGIC;
           I_addr : in  STD_LOGIC_VECTOR (15 downto 0);
           I_data : in  STD_LOGIC_VECTOR (15 downto 0);
           O_data : out  STD_LOGIC_VECTOR (15 downto 0));
end ram_tb;

architecture Behavioral of ram_tb is
  type store_t is array (0 to 7) of std_logic_vector(15 downto 0);
   signal ram: store_t := (
    OPCODE_LOAD & "000" & '0' & X"fe",
    OPCODE_LOAD & "001" & '1' & X"ed",
    OPCODE_OR & "010" & '0' & "000" & "001" & "00",
    OPCODE_LOAD & "011" & '1' & X"01",
    OPCODE_LOAD & "100" & '1' & X"02",
    OPCODE_ADD & "011" & '0' & "011" & "100" & "00",
    OPCODE_OR & "101" & '0' & "000" & "011" & "00",
    OPCODE_AND & "101" & '0' & "101" & "010" & "00"
    );
begin

  process (I_clk)
  begin
    if rising_edge(I_clk) then
      if (I_we = '1') then
        ram(to_integer(unsigned(I_addr(2 downto 0)))) <= I_data;
      else
        O_data <= ram(to_integer(unsigned(I_addr(2 downto 0))));
      end if;
    end if;
  end process;

end Behavioral;

We can then initialize the unit in our main test bench body. The PC unit should be added too.

component ram_tb
Port ( I_clk : in  STD_LOGIC;
      I_we : in  STD_LOGIC;
      I_addr : in  STD_LOGIC_VECTOR (15 downto 0);
      I_data : in  STD_LOGIC_VECTOR (15 downto 0);
      O_data : out  STD_LOGIC_VECTOR (15 downto 0)
      );
end component;

COMPONENT pc_unit
PORT(
     I_clk : IN  std_logic;
     I_nPC : IN  std_logic_vector(15 downto 0);
     I_nPCop : IN  std_logic_vector(1 downto 0);
     O_PC : OUT std_logic_vector(15 downto 0)
    );
END COMPONENT;


signal ramWE : std_logic := '0';
signal ramAddr: std_logic_vector(15 downto 0);
signal ramRData: std_logic_vector(15 downto 0);
signal ramWData: std_logic_vector(15 downto 0);

signal nPC: std_logic_vector(15 downto 0);
signal pcop: std_logic_vector(1 downto 0);
signal in_pc: std_logic_vector(15 downto 0);

After the begin statement, we define the parts and map the ports to respective signals.

uut_ram: ram_tb Port map (
  I_clk => I_clk,
  I_we => ramWE,
  I_addr => ramAddr,
  I_data => ramWData,
  O_data => ramRData
);
uut_pcunit: pc_unit Port map (
  I_clk => I_clk,
  I_nPC => in_pc,
  I_nPCop => pcop,
  O_PC => PC
);

We don’t use the ramWData or ramWE ports, so assign them to some default values. the ramAddr port should always be assigned to PC, and the instruction port we used to set manually we will assign from ramRData – the output from our ram. We put these assignments outside of a process – they are always true.

ramAddr <= PC;
ramWData <= X"FFFF";
ramWE <= '0';
instruction <= ramRData;

The last input we need to figure out is the PC unit operation. What we need to do is tie this to the current control unit state – in that, we want it to increment at a given point, and we want the rest of the time to be in the NOP mode.

pcop<= PCU_OP_RESET when reset = '1' else PCU_OP_INC when state(2) = '1' else PCU_OP_NOP;

This resets the PC unit when there is a control unit reset, increments on the ALU execute stage of the pipeline, and nops otherwise.

As for the testbench itself, we no longer feed the instruction signal with our instructions, so we remove all that. We have a short reset, and then wait until our PC is 8 – as then we have overflowed memory, and our test is complete. The ‘CPU’ should be self sustaining, execute everything up until the PC equals 0x8, then go into an indefinite reset state.

stim_proc: process
begin		

  reset<='1'; -- reset control unit
  wait for I_clk_period; -- wait a cycles
  reset <= '0';

  wait until PC = X"0008"; -- we only have 8 instructions
  reset <= '1'; -- reset control unit
  wait;

end process;

Getting back to the PC unit, the point in the pipeline when we increment is very important. We need to account for any latency, for all uses of the PC – as shown in the next two simulation captures.

latest_onlyPC_inc_at_exec_no_fetch latest_onlyPC_inc_at_writeback_no_fetchYou can see when the increment happens at the writeback phase, we have an issue that the instructions are delayed and subsequently offset from each other. The problem here is that the increment needs to happen at the writeback, as the ALU in a branch instruction could issue an assign to PC – and in the first image (at yellow circles) you can see where we’d want to increment/set the ALU has not got it’s result ready.

The Fetch Stage

Previously we had a cloud of uncertainty around the fetch area of the instruction pipeline. We need to add it now, to account for any latencies in fetching the opcode for the decode stage. I’m adding it by extending our simple control unit, with two extra bits of state. The second bit will be for the memory stage of the pipeline which we’ll discuss later, but best to add it now. Simply extend the std_logic_vector, add the extra case statements, and extend the outputs where required. We ignore the last state switch – the memory stage. In the test bench source, we need to add an en_fetch and also change en_decode, en_regread, etc, to reflect new bit positions. They are just shifted up one.

If we change our assignment to pcop, to now inc on the writeback phase (bit 4 now, due to fetch now being bit 0):

pcop <= PCU_OP_RESET when reset = '1' else
        PCU_OP_INC when state(4) = '1' else
        PCU_OP_NOP;

Simulating it results in the following output:

fetchstage_inc_writebackA) increment PC, on previous writeback cycle. 0x1 now on ram address bus.
B) ram puts value at 0x1 on read output bus, appears at decoder instruction input in time for decode cycle.
C) decoder outputs results from decode
D) result from alu ready for writeback.

We can see this extra fetch cycle of latency allows for the increment/PC operation to occur as late as possible, with the ALU result also available when this happens.

Branching

So now, what about branching? We need to set the pc to the output of the alu, if shouldBranch is true. This is, surprisingly, very easy to achieve. We simply amend the assignment to the pcop to add the assign condition:

pcop <= PCU_OP_RESET when reset = '1' else
        PCU_OP_ASSIGN when shouldBranch = '1' and state(4) = '1' else
        PCU_OP_INC when shouldBranch = '0' and state(4) = '1' else
        PCU_OP_NOP;

We will change our last instruction in the RAM to jump back to our addition, to make an infinite loop.

signal ram: store_t := (
  OPCODE_LOAD & "000" & '0' & X"fe",
  OPCODE_LOAD & "001" & '1' & X"ed",
  OPCODE_OR & "010" & '0' & "000" & "001" & "00",
  OPCODE_LOAD & "011" & '1' & X"01",
  OPCODE_LOAD & "100" & '1' & X"02",
  OPCODE_ADD & "011" & '0' & "011" & "100" & "00",
  OPCODE_OR & "101" & '0' & "000" & "011" & "00",
  OPCODE_JUMP & "000" & '1' & X"05"
  );

We also need to remove the ‘wait until PC = X”0008″;’ from the testbench. The test code effectively becomes just a reset and a wait. We must also now set the PC input to the PC unit to our ALU output, so we can assign the PC to new values.


in_pc <= dataResult;

stim_proc: process
 begin		

  reset<='1'; -- reset control unit
  wait for I_clk_period; -- wait a cycles
  reset <= '0';

  wait;
end process;

And that will then produce this rather fun simulation – self sustaining, with branching!

self_sustain_branchingI thought it would be useful to go and zoom into the branch on that simulation, and step through the transitions.

latest_onlyPC_inc_at_write_with_fetch_see_correct_branch_2At (A), the ALU has decided we should branch. As we are in the writeback phase of the pipeline, pcop becomes assign instead of increment at (B), putting dataresult on in_pc. the assign happens, and the next cycle, now in fetch (C), we have the new instruction being decoded from our branch target.

To show issues that can occur without correct timing, we have a simulation without the fetch stage and PC operation at the ALU stage. The branch happens, eventually, but the CPU executes an additional instruction first. Some architectures actually do this, and you need to fill out the instruction stream after branches with nops or other useful operations. Look up Delay Slots if you are interested in learning more. For the TPU design, I didn’t want this.

latest_onlyPC_inc_at_exec_no_fetch_see_wrong_branchConditional branching

We may as well check whether conditional branches work. For this, we just create a new test with a larger RAM, and fill it with the following code:

signal ram: store_t := (
  OPCODE_LOAD & "001" & '1' & X"05",
  OPCODE_LOAD & "010" & '1' & X"03",
  OPCODE_XOR & "000" & '0' & "000" & "000" & "00",
  OPCODE_LOAD & "011" & '1' & X"01",
  OPCODE_LOAD & "110" & '1' & X"0b",
  OPCODE_OR & "100" & '0' & "010" & "010" & "00",
  OPCODE_CMP & "101" & '0' & "100" & "000" & "00",
  OPCODE_JUMPEQ & "000" & '0' & "101" & "110" & "01",
  OPCODE_SUB & "100" & '0' & "100" & "011" & "00",
  OPCODE_ADD & "000" & '0' & "000" & "001" & "00",
  OPCODE_JUMP & "000" & '1' & X"06",
  OPCODE_JUMP & "000" & '1' & X"0b",
  X"0000",
  X"0000",
  X"0000",
  X"0000"
);

This is code for a simple multiply, with the operands in r1 and r2, the result going in r0.

0x00    load.l r1, 0x05
0x01    load.l r2, 0x03
0x02    xor    r0, r0, r0
0x03    load.l r3, 0x01
0x04    load.l r6, $END
0x05    or     r4, r2, r2
      loop:
0x06    cmp.u    r5, r4, r0
0x07    jump.aez r5, r6
0x08    sub.u    r4, r4, r3
0x09    add.u    r0, r0, r1
0x0a    jump $loop
      end:
0x0b    jump $end

It simulates, and it works. You can see the branch structure from the simulation quite easily, and the result is valid.

self_sustain_conditional_branchingThere are a few ISA issues apparent when we start drawing up ‘more complex’ TPU assembly examples. Having some sort of simple add/sub immediate and a conditional jump to relative offset would reduce our instruction count. But, we can always revisit and change our ISA – one of the benefits of rolling your own cpu!

Wrapping Up

So, we managed to get TPU self-sustaining – that is, it chooses which instructions to consume, and does so indefinitely. The test bench essentially becomes a wait, as the CPU decides what to do based on the instruction stream – It’s a real CPU!

To do this, we had to add our fetch stage to the pipeline, and also investigate issues with when changes to the PC occurred. We added our PC unit to track this, and it’s fairly simple but allows for normal sequential execution and branching.

An issue with our CPU is that it’s still relatively limited by number of registers, and the fact we cannot get any real input other than instructions. To get more, we need to give the CPU access to manipulate memory. And with that, possibly memory-mapped peripherals. So we’ll look into that next.

Thanks for reading, comments as always to @domipheus.

Designing a CPU in VHDL, Part 5: Pipeline and Control Unit

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

This is a disclaimer that the VHDL here is probably not the best you will see, but it gets the job done – in the simulator, at least. If you spot any serious errors, or woeful performance gotchas I’ve fallen for – please let me know at @domipheus. The aim of these posts is to get a very simple 16-bit CPU up and running, and then get stuck into some optimization opportunities later.

We now have our decoder, ALU, and registers. We can now create a test bench to simulate what happens when we connect those together in a fixed, controllable way. The testbench source is on github, so I won’t go too much into creating the boilerplate. Create an ALU test bench, and then manually add the decoder, register file, and associated signals to the file.

Joining functional units together

With our ALU, decoder and regs entities available, we can connect them together with signals like so.

uut_decoder: decode PORT MAP (
    I_clk => I_clk,
    I_dataInst => instruction,
    I_en => en,
    O_selA => selA,
    O_selB => selB,
    O_selD => selD,
    O_dataIMM => dataIMM,
    O_regDwe => dataDwe,
    O_aluop => aluop
  );

uut_alu: alu PORT MAP (
    I_clk => I_clk,
    I_en => en,
    I_dataA => dataA,
    I_dataB => dataB,
    I_dataDwe => dataDwe,
    I_aluop => aluop,
    I_PC => PC,
    I_dataIMM => dataIMM,
    O_dataResult => dataResult,
    O_dataWriteReg => dataWriteReg,
    O_shouldBranch => shouldBranch
  );

uut_reg: reg16_8 PORT MAP (
    I_clk => I_clk,
    I_en => '1',
    I_dataD => dataResult,
    O_dataA => dataA,
    O_dataB => dataB,
    I_selA => selA,
    I_selB => selB,
    I_selD => selD,
    I_we => dataWriteReg
  );

If we include the tpu_constants file, we can use the opcode definitions to write the instruction signal before waiting a cycle. We also need to remember to enable the units using the enable ‘en’ signal, which is connected to all units.

-- Stimulus process
stim_proc: process
begin
  -- hold reset state for 100 ns.
  wait for 100 ns;	

  wait for I_clk_period*10;
  en<='1';

  --load.h r0,0xfe
  instruction <= OPCODE_LOAD & "000" & '0' & X"fe";
  wait for I_clk_period;

  --load.l r1, 0xed
  instruction <= OPCODE_LOAD & "001" & '1' & X"ed";
  wait for I_clk_period;

  --or r2, r0, r1
  instruction <= OPCODE_OR & "010" & '0' & "000" & "001" & "00";
  wait for I_clk_period;
  wait;
end process;

When simulating this, you get the wrong result. It’s fairly understandable why – we’ve designed these units to be part of a pipeline – and we’re only giving a single cycle of latency for a result at the end of it, when we need at least two cycles.

first_run_1cycleIf we edit our test to add two cycles of latency, we get a much better result.

  --load.h r0,0xfe
  instruction <= OPCODE_LOAD & "000" & '0' & X"fe";
  wait for I_clk_period;
  wait for I_clk_period;

  --load.l r1, 0xed
  instruction <= OPCODE_LOAD & "001" & '1' & X"ed";
  wait for I_clk_period;
  wait for I_clk_period;

  --or r2, r0, r1
  instruction <= OPCODE_OR & "010" & '0' & "000" & "001" & "00";
  wait for I_clk_period;
  wait for I_clk_period;

2_first_run_2cycleLooking at the memory view, for our uut_reg object, we can see r0, r1 and r2 have the correct data.

2_first_run_2cycle_regmemSuccess! But what about the following example?

--load.l r3, 1
instruction <= OPCODE_LOAD & "011" & '1' & X"01";
wait for I_clk_period;
wait for I_clk_period;

--load.l r4, 2
instruction <= OPCODE_LOAD & "100" & '1' & X"02";
wait for I_clk_period;
wait for I_clk_period;

--add.u r3, r3, r4
instruction <= OPCODE_ADD & "011" & '0' & "011" & "100" & "00";
wait for I_clk_period;
wait for I_clk_period;

You will see one item of interest here. The add instruction uses the same register as a source and destination. We tested the register file before, as we know reading from/writing to the same register is technically fine. But what when it’s part of a bigger system? It doesn’t go so well.

wrong_write_depI’ve annotated the simulation timeline with the instructions and two points of interest. Point A is the result of the add – it’s delayed by a cycle. This offsets the result for any potential next instruction – which you can see at point B.

--or r5, r0, r3
instruction <= OPCODE_OR & "101" & '0' & "000" & "011" & "00";
wait for I_clk_period;
wait for I_clk_period;

If we add the above instruction to the mix we can see how serious of an issue this can be.

wrong_write_dep_proofThe correct write into r3 doesn’t happen in time for the OR operation, so the result in r5 will be incorrect. And we can see this by looking at the uut_reg unit in memory view.

wrong_or_reg_r3We can see that with the units designed as they are, 3 cycles will be needed for safe decode – execute – writeback. Adding a third wait for I_clk_period; after the add instruction, which allows time for the writeback into the register, allows for correct operation.

right_r3_extra_writeback_cycleControl Unit

What we need for all of this, isn’t to add cycles of latency like above, but to add a unit responsible for telling other units what to do, and when. To figure out what we want, we’re going to need to lay out what we think our pipeline looks like, and what else we need.

pipeI’m going all out simple here, giving an independent cycle for every part of the pipeline. We don’t execute the next instruction until the current one has traversed every stage. We still don’t know whats happening in the fetch and memory stages, but fairly certain of the others. We add a read stage to be sure our data is ready for the ALU when needed – we can always remove states later when we optimize the completed design.

Since all of the units we’ve created so far have enable ports, we can get a control unit to synchronize everything up and drive those enable bits. The control unit will have one output, a bitmask showing the pipeline state currently active, as well as a reset and clock input. Each clock cycle the state will increment by one bit, and if reset is high it will reset to initial state. The control unit is technically a state machine, but for now it’s simple enough we can just classify it as a counter. You can do this with integers or other types, I’m just a fan of raw bit vectors.

entity controlsimple is
  Port ( I_clk : in  STD_LOGIC;
         I_reset : in  STD_LOGIC;
         O_state : out  STD_LOGIC_VECTOR (3 downto 0)
         );
end controlsimple;

architecture Behavioral of controlsimple is
  signal s_state: STD_LOGIC_VECTOR(3 downto 0) := "0001";
begin
  process(I_clk)
  begin
    if rising_edge(I_clk) then
      if I_reset = '1' then
        s_state <= "0001";
      else
        case s_state is
          when "0001" =>
            s_state <= "0010";
          when "0010" =>
            s_state <= "0100";
          when "0100" =>
            s_state <= "1000";
          when "1000" =>
            s_state <= "0001";
          when others =>
            s_state <= "0001";
        end case;
      end if;
    end if;
  end process;

  O_state <= s_state;
end Behavioral;

We can create a new testbench based off of out current decode, alu, register one, and add in the simplecontrol component. I won’t post all the code to that here, but you can see it on github. We attach the enable lines of the components to the relevant bits of our state output:

en_decode <= state(0);
en_regread <= state(1);
en_alu <= state(2);
en_regwrite <= state(3);

The register file is a bit different in terms of the enable bit hookup, as to accommodate two states in the pipeline. The enable bit is tied to both read and write states, but we ensure writes only happen at the correct stage by using it in the write enable input.

uut_reg: reg16_8 PORT MAP (
  I_clk => I_clk,
  I_en => en_regread or en_regwrite,
  I_dataD => dataResult,
  O_dataA => dataA,
  O_dataB => dataB,
  I_selA => selA,
  I_selB => selB,
  I_selD => selD,
  I_we => dataWriteReg and en_regwrite
);

Now we need slight changes to the test bench process itself. We populate our first instruction whilst the control unit is in a reset state, before enabling it. Then, instead of waiting after each instruction is assigned by a specific number of clock cycles, we wait until the register writeback state is reached. This is the last part of the pipeline, and is the MSB of our state output from the control unit – signal en_regwrite in our test. At this point, it’s safe to assign the next instruction to the decoder input. We repeat this for all instructions like so:

reset <= '1'; -- reset control unit
--load.h r0,0xfe
instruction <= OPCODE_LOAD & "000" & '0' & X"fe";
reset <= '0'; -- enable/start control unit
wait until en_regwrite = '1';

--load.l r1, 0xed
instruction <= OPCODE_LOAD & "001" & '1' & X"ed";
wait until en_regwrite = '1';

--or r2, r0, r1
instruction <= OPCODE_OR & "010" & '0' & "000" & "001" & "00";
wait until en_regwrite = '1';

--load.l r3, 1
instruction <= OPCODE_LOAD & "011" & '1' & X"01";
wait until en_regwrite = '1';

--load.l r4, 2
instruction <= OPCODE_LOAD & "100" & '1' & X"02";
wait until en_regwrite = '1';

--add.u r3, r3, r4
instruction <= OPCODE_ADD & "011" & '0' & "011" & "100" & "00";
wait until en_regwrite = '1';

--or r5, r0, r3
instruction <= OPCODE_OR & "101" & '0' & "000" & "011" & "00";
wait until en_regwrite = '1';

control_1Running this in the simulator, we get a nice output waveform. All instructions executed correctly, and ran perfectly in sync. The registers at the end of the simulation are listed in the image above, with a red line linking it to when that particular register was written. I’ve annotated the pipeline stages too. You can see that the next instruction gets assigned to the decoder input in the writeback stage, so there is a little bit of overlap in the pipeline, but that’s to be expected. It’s not real concurrent operation within the pipeline, it’s just ensuring the data is available when it’s needed.

Wrapping Up

Hopefully this has shown the importance of the control unit, and how it conducts the process of execution in this CPU. It has many more functions to manage – especially when we start implementing memory operations, and dealing with that shouldBranch ALU output – more on that later!

Thanks for reading, comments as always to @domipheus.

Designing a CPU in VHDL, Part 4: The ALU, Comparisons and Branching

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

This is a little disclaimer that the VHDL here is probably not the best you will see, but it gets the job done – in the simulator, at least. If you spot any serious errors, or woeful performance gotchas I’ve fallen for – please let me know at @domipheus.

The ALU is the next component to implement. It should take the opcode given by the decoder, along with input data read from the register file, and output a result that can then be written to the register file.

aluIn TPU, we are only doing single cycle operations. Whilst a whole instruction will take multiple cycles to complete, the ALU itself will output a result in a single clock cycle. The other cycles are due to decoding, fetching and any writebacks of results which are required.

Because of this fact, you can implement an ALU very simply by performing every operation possible and then select and output only the result we require, given the ‘aluop’ forwarded by the decoder. Another way (the way I have done) is to have a case statement and assign to an internal register the operation of choice, before then writing the contents of that register to the output ports. Underneath I think they would be implemented similarly, with a great many multiplexers.

A result isn’t the only thing the ALU should provide. Some of the operations implemented in TPU are branch/jump instructions, some conditional. So, we also need a signal stating whether or not we should take a branch. We also have our own copy of the register file write enable bit – so we can stop the result being written based on a condition.

The VHDL Module

We create a new VHDL module named ALU, with the following ports:

entity alu is Port (
    I_clk : in  STD_LOGIC;
    I_en :  in  STD_LOGIC;
    I_dataA :   in STD_LOGIC_VECTOR (15 downto 0);
    I_dataB :   in STD_LOGIC_VECTOR (15 downto 0);
    I_dataDwe : in STD_LOGIC;
    I_aluop :   in STD_LOGIC_VECTOR (4 downto 0);
    I_PC :      in STD_LOGIC_VECTOR (15 downto 0);
    I_dataIMM : in STD_LOGIC_VECTOR (15 downto 0);
    O_dataResult :   out  STD_LOGIC_VECTOR (15 downto 0);
    O_dataWriteReg : out STD_LOGIC;
    O_shouldBranch : out std_logic
  );
end alu;

Our ALU process which runs on a rising clock edge, when I_en is active, will immediately enter a case statement dependent on the alu operation forwarded from the decoder. Each when block within the case statement will write to elements of an internal register s_result, which is 18 bits wide, to accommodate carry/overflow status. There is also an internal signal for the shouldBranch output.

The following output is the ALU behavior, minus all operations except add.

architecture Behavioral of alu is
  -- The internal register for results of operations.
  -- 16 bit + carry/overflow
  signal s_result: STD_LOGIC_VECTOR(17 downto 0) := (others => '0');
  signal s_shouldBranch: STD_LOGIC := '0';
begin
  process (I_clk, I_en)
  begin
    if rising_edge(I_clk) and I_en = '1' then
      O_dataWriteReg <= I_dataDwe;
      case I_aluop(4 downto 1) is
        when OPCODE_ADD =>
          if I_aluop(0) = '0' then
            s_result(16 downto 0) <= std_logic_vector(unsigned('0' & I_dataA) + unsigned( '0' & I_dataB));
          else
            s_result(16 downto 0) <= std_logic_vector(signed(I_dataA(15) & I_dataA) + signed( I_dataB(15) & I_dataB));
          end if;
          s_shouldBranch <= '0';

        -- ...Other OPCODEs go here...
        when others =>
        s_result <= "00" & X"FEFE";
      end case;
    end if;
  end process;

  O_dataResult <= s_result(15 downto 0);
  O_shouldBranch <= s_shouldBranch;

end Behavioral;

Operations

The Add operation is the first opcode. It has signed and unsigned variants, dependent on the instruction flag which is bit 0 of the aluop input. If the flag is 0, the register values are resized to 17-bit, cast to integer data and an addition performed. If the flag bit is set to indicate signed operands, sign-extension needs to be performed instead of a resize – meaning we concatenate the last bit of the register data to become the 17th bit of our operand. The Addition is performed, and result written back to 17 bits of s_result. this provides data needed for a carry bit in unsigned mode and an overflow bit in signed mode, but we do not use them yet (and overflow needs some more logic). The last thing which is done is that we set the shouldBranch signal to ‘0’, as we should not branch at this time.

The other logical operations are simple, implemented in the following fashion:

when OPCODE_OR =>
  s_result(15 downto 0) <= I_dataA or I_dataB;
  s_shouldBranch <= '0';

The LOAD operation uses the flag in bit 8 to indicate whether an 8-bit immediate value should be placed in the high or low half of the destination register:

when OPCODE_LOAD =>
  if I_aluop(0) = '0' then
    s_result(15 downto 0) <= I_dataIMM(7 downto 0) & X"00";
  else
    s_result(15 downto 0) <= X"00" & I_dataIMM(7 downto 0);
  end if;
  s_shouldBranch <= '0';

The other bits are zero’d out to form our 16-bit result.

The Compare Instruction

Most typical, older 16bit CPUs tend to set status bits in the ALU when certian events occur. On Z80 the status bits change when a large defined set of operations are executed, others have an explicit compare instruction which sets some status bits. Those bits can then be queried via special instructions which read direct from the internal status register. TPU does things a bit differently in that it has very little internal status, and writes the results of compares to a register via an explicit instruction. The register contains a series of bits which indicate the result of a boolean operation. The instruction (from the ISA document) is as follows: cmp_inst_isaThe compare instruction in TPU is of form RRR, taking a destination register and two sources. The flag bit is to do either signed or unsigned comparisons. The instruction calculates rA == rB, rA > rB, rA < rB, rA == 0 and rB ==0 and writes the rersult into specific bit locations of the output register rD. We don’t care for the other bits, so they should be zero’d.

There isn’t much reason for why it’s set up like this, apart from that I wanted to do multiple conditions at once, and get a bitmask out to do further logic on the result if need be. With so few registers, there is a fair argument that this could be a stupid path to take.

The CMP instruction has some extra constants we need to define – the bit locations for the output of those comparisons.

-- cmp output bits
constant CMP_BIT_EQ:  integer := 14;
constant CMP_BIT_AGB: integer := 13;
constant CMP_BIT_ALB: integer := 12;
constant CMP_BIT_AZ:  integer := 11;
constant CMP_BIT_BZ:  integer := 10;

I’ve created a vhdl package file, where I place all constants for TPU. You can include the constants file using the library and use statements, like

library work;
use work.tpu_constants.all;

With output bit locations defined per the ISA, we can start writing into the intermediate s_result.

when OPCODE_CMP =>
  if I_dataA = I_dataB then
    s_result(CMP_BIT_EQ) <= '1';
  else
    s_result(CMP_BIT_EQ) <= '0';
  end if;

  if I_dataA = X"0000" then
    s_result(CMP_BIT_AZ) <= '1';
  else
    s_result(CMP_BIT_AZ) <= '0';
  end if;

  if I_dataB = X"0000" then
    s_result(CMP_BIT_BZ) <= '1';
  else
    s_result(CMP_BIT_BZ) <= '0';
  end if;

These operations don’t care whether the comparisons are signed or not, so we can just do them and write out the correct bit. I tried doing a conditional move here, instead of the big chunk of annoying if/else/assign1or0 logic but it seems that’s only in a newer version of VHDL. So I’ll just keep it at this for now. The final two comparisons are the greater/less than, which do depend on sign. So we check the flag bit for signed mode before writing out the results of the comparison.

if I_aluop(0) = '0' then
  if unsigned(I_dataA) > unsigned(I_dataB) then
    s_result(CMP_BIT_AGB) <= '1';
  else
    s_result(CMP_BIT_AGB) <= '0';
  end if;
  if unsigned(I_dataA) < unsigned(I_dataB) then
    s_result(CMP_BIT_ALB) <= '1';
  else
    s_result(CMP_BIT_ALB) <= '0';
  end if;
else
  if signed(I_dataA) > signed(I_dataB) then
    s_result(CMP_BIT_AGB) <= '1';
  else
    s_result(CMP_BIT_AGB) <= '0';
  end if;
  if signed(I_dataA) < signed(I_dataB) then
    s_result(CMP_BIT_ALB) <= '1';
  else
    s_result(CMP_BIT_ALB) <= '0';
  end if;
end if;

Finally, we zero out the remaining, unused bits – and set our shouldbranch output to 0.

s_result(15) <= '0';
s_result(9 downto 0) <= "0000000000";
s_shouldBranch <= '0';

And that’s it done!

Shifts

Shifts are cumbersome. Currently, I’ve only defined logical, unsigned shifts – by a value stored in a register. That will need to change, but there are enough flag bits for that to work. Originally I used a fixed-length loop with an early out, which was synthesizable, but upon testing the result just never seemed correct. So now, I’m left with the following, which seems horrendous but works.

when OPCODE_SHL =>
  case I_dataB(3 downto 0) is
    when "0001" =>
      s_result(15 downto 0) <= std_logic_vector(shift_left(unsigned(I_dataA), 1));
    when "0010" =>
      s_result(15 downto 0) <= std_logic_vector(shift_left(unsigned(I_dataA), 2));
    when "0011" =>
      s_result(15 downto 0) <= std_logic_vector(shift_left(unsigned(I_dataA), 3));
    when "0100" =>
      s_result(15 downto 0) <= std_logic_vector(shift_left(unsigned(I_dataA), 4));
    when "0101" =>
      s_result(15 downto 0) <= std_logic_vector(shift_left(unsigned(I_dataA), 5));
    when "0110" =>
      s_result(15 downto 0) <= std_logic_vector(shift_left(unsigned(I_dataA), 6));
    when "0111" =>
      s_result(15 downto 0) <= std_logic_vector(shift_left(unsigned(I_dataA), 7));
    when "1000" =>
      s_result(15 downto 0) <= std_logic_vector(shift_left(unsigned(I_dataA), 8));
    when "1001" =>
      s_result(15 downto 0) <= std_logic_vector(shift_left(unsigned(I_dataA), 9));
    when "1010" =>
      s_result(15 downto 0) <= std_logic_vector(shift_left(unsigned(I_dataA), 10));
    when "1011" =>
      s_result(15 downto 0) <= std_logic_vector(shift_left(unsigned(I_dataA), 11));
    when "1100" =>
      s_result(15 downto 0) <= std_logic_vector(shift_left(unsigned(I_dataA), 12));
    when "1101" =>
      s_result(15 downto 0) <= std_logic_vector(shift_left(unsigned(I_dataA), 13));
    when "1110" =>
      s_result(15 downto 0) <= std_logic_vector(shift_left(unsigned(I_dataA), 14));
    when "1111" =>
      s_result(15 downto 0) <= std_logic_vector(shift_left(unsigned(I_dataA), 15));
    when others =>
      s_result(15 downto 0) <= I_dataA;
  end case;
  s_shouldBranch <= '0';

Euugh. It passes the tests. Ship it.

Everybody Jump! Jump! Jump!

Branching, Jumping, Changing the program counter. It’s something that needs to happen at some point. We have a few instructions available for this. A jump to an 8-bit immediate value, and a jump to a 16-bit register value. Those are pretty simple, and set the s_shouldBranch flag to ‘1’, and fill out the s_result with our branch target.

Conditional jumps are the same, really – but the s_shouldBranch flag is dependant on some condition. Looking at the ISA we can see the various conditions available. jumpeq_inst_isaThe C bits are quite a mess here, grouped from the signed flag and the two unused bits at the end of the instruction. I’ll probably change this, allowing for the rD selection bits to be forwarded on to the ALU in some way, possibly as a second ‘function flag’ immediate, to allow their use. That would enable some variant of jump conditionally to relative offset instructions – which are going to be needed when working in nested loops. Without relative immediate conditional jumps, there will be severe pressure on the registers, leaving very few available for actual computation.

Implementation wise, this function is pretty simple – it sets the shouldBranch output to the correct bit in the CMP input.

when OPCODE_JUMPEQ =>
  -- set branch target regardless
  s_result(15 downto 0) <= I_dataB;

   -- the condition to jump is based on aluop(0) and dataimm(1 downto 0);
  case (I_aluop(0) & I_dataIMM(1 downto 0)) is
    when CJF_EQ =>
      s_shouldBranch <= I_dataA(CMP_BIT_EQ);
    when CJF_AZ =>
      s_shouldBranch <= I_dataA(CMP_BIT_Az);
    when CJF_BZ =>
      s_shouldBranch <= I_dataA(CMP_BIT_Bz);
    when CJF_ANZ =>
      s_shouldBranch <= not I_dataA(CMP_BIT_AZ);
    when CJF_BNZ =>
      s_shouldBranch <= not I_dataA(CMP_BIT_Bz);
    when CJF_AGB =>
      s_shouldBranch <= I_dataA(CMP_BIT_AGB);
    when CJF_ALB =>
      s_shouldBranch <= I_dataA(CMP_BIT_ALB);
    when others =>
        s_shouldBranch <= '0';
  end case;

That’s all the ALU needs to do, the actual branching logic happens elsewhere, and we’ll talk about that another time.

Testing

Testing the ALU is easy, but there are so many combinations of input to output that getting good coverage would take a considerable amount of time. Testing the ALU in isolation is easy – simply set the various inputs and expect the correct output, making sure it takes only a single clock cycle.

I_dataA <= X"0001";
I_dataB <= X"0002";
I_aluop <= OPCODE_ADD & '0';
I_dataIMM <= X"F1FA";

wait for I_clk_period;
I_dataA <= X"0005";
I_dataB <= X"0003";
I_aluop <= OPCODE_SUB & '0';

wait for I_clk_period;

I_dataA <= X"FEEE";
I_dataB <= X"0000";
I_aluop <= OPCODE_CMP & '0';

wait for I_clk_period;

Above we have tests for unsigned addition, subtraction, and compare. We can simulate this, and view the data output to verify operation. A larger test is on the github repo, but again, coverage isn’t great. It’s enough to progress, we can validate later (*cough*).

test_simWrapping up

And there we have it, a simple ALU. I’ve finally got the TPU repo up on github, it’s fairly sparse just now, as the ISE project isn’t included. Only the sources which have been previously mentioned are there. The ISA is also there, although obviously incomplete.

Thanks for reading, comments as always to @domipheus, and the next part should be finally starting to look into how these various units work together.