Designing a CPU in VHDL, Part 5: Pipeline and Control Unit

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

This is a disclaimer that the VHDL here is probably not the best you will see, but it gets the job done – in the simulator, at least. If you spot any serious errors, or woeful performance gotchas I’ve fallen for – please let me know at @domipheus. The aim of these posts is to get a very simple 16-bit CPU up and running, and then get stuck into some optimization opportunities later.

We now have our decoder, ALU, and registers. We can now create a test bench to simulate what happens when we connect those together in a fixed, controllable way. The testbench source is on github, so I won’t go too much into creating the boilerplate. Create an ALU test bench, and then manually add the decoder, register file, and associated signals to the file.

Joining functional units together

With our ALU, decoder and regs entities available, we can connect them together with signals like so.

uut_decoder: decode PORT MAP (
    I_clk => I_clk,
    I_dataInst => instruction,
    I_en => en,
    O_selA => selA,
    O_selB => selB,
    O_selD => selD,
    O_dataIMM => dataIMM,
    O_regDwe => dataDwe,
    O_aluop => aluop
  );

uut_alu: alu PORT MAP (
    I_clk => I_clk,
    I_en => en,
    I_dataA => dataA,
    I_dataB => dataB,
    I_dataDwe => dataDwe,
    I_aluop => aluop,
    I_PC => PC,
    I_dataIMM => dataIMM,
    O_dataResult => dataResult,
    O_dataWriteReg => dataWriteReg,
    O_shouldBranch => shouldBranch
  );

uut_reg: reg16_8 PORT MAP (
    I_clk => I_clk,
    I_en => '1',
    I_dataD => dataResult,
    O_dataA => dataA,
    O_dataB => dataB,
    I_selA => selA,
    I_selB => selB,
    I_selD => selD,
    I_we => dataWriteReg
  );

If we include the tpu_constants file, we can use the opcode definitions to write the instruction signal before waiting a cycle. We also need to remember to enable the units using the enable ‘en’ signal, which is connected to all units.

-- Stimulus process
stim_proc: process
begin
  -- hold reset state for 100 ns.
  wait for 100 ns;	

  wait for I_clk_period*10;
  en<='1';

  --load.h r0,0xfe
  instruction <= OPCODE_LOAD & "000" & '0' & X"fe";
  wait for I_clk_period;

  --load.l r1, 0xed
  instruction <= OPCODE_LOAD & "001" & '1' & X"ed";
  wait for I_clk_period;

  --or r2, r0, r1
  instruction <= OPCODE_OR & "010" & '0' & "000" & "001" & "00";
  wait for I_clk_period;
  wait;
end process;

When simulating this, you get the wrong result. It’s fairly understandable why – we’ve designed these units to be part of a pipeline – and we’re only giving a single cycle of latency for a result at the end of it, when we need at least two cycles.

first_run_1cycleIf we edit our test to add two cycles of latency, we get a much better result.

  --load.h r0,0xfe
  instruction <= OPCODE_LOAD & "000" & '0' & X"fe";
  wait for I_clk_period;
  wait for I_clk_period;

  --load.l r1, 0xed
  instruction <= OPCODE_LOAD & "001" & '1' & X"ed";
  wait for I_clk_period;
  wait for I_clk_period;

  --or r2, r0, r1
  instruction <= OPCODE_OR & "010" & '0' & "000" & "001" & "00";
  wait for I_clk_period;
  wait for I_clk_period;

2_first_run_2cycleLooking at the memory view, for our uut_reg object, we can see r0, r1 and r2 have the correct data.

2_first_run_2cycle_regmemSuccess! But what about the following example?

--load.l r3, 1
instruction <= OPCODE_LOAD & "011" & '1' & X"01";
wait for I_clk_period;
wait for I_clk_period;

--load.l r4, 2
instruction <= OPCODE_LOAD & "100" & '1' & X"02";
wait for I_clk_period;
wait for I_clk_period;

--add.u r3, r3, r4
instruction <= OPCODE_ADD & "011" & '0' & "011" & "100" & "00";
wait for I_clk_period;
wait for I_clk_period;

You will see one item of interest here. The add instruction uses the same register as a source and destination. We tested the register file before, as we know reading from/writing to the same register is technically fine. But what when it’s part of a bigger system? It doesn’t go so well.

wrong_write_depI’ve annotated the simulation timeline with the instructions and two points of interest. Point A is the result of the add – it’s delayed by a cycle. This offsets the result for any potential next instruction – which you can see at point B.

--or r5, r0, r3
instruction <= OPCODE_OR & "101" & '0' & "000" & "011" & "00";
wait for I_clk_period;
wait for I_clk_period;

If we add the above instruction to the mix we can see how serious of an issue this can be.

wrong_write_dep_proofThe correct write into r3 doesn’t happen in time for the OR operation, so the result in r5 will be incorrect. And we can see this by looking at the uut_reg unit in memory view.

wrong_or_reg_r3We can see that with the units designed as they are, 3 cycles will be needed for safe decode – execute – writeback. Adding a third wait for I_clk_period; after the add instruction, which allows time for the writeback into the register, allows for correct operation.

right_r3_extra_writeback_cycleControl Unit

What we need for all of this, isn’t to add cycles of latency like above, but to add a unit responsible for telling other units what to do, and when. To figure out what we want, we’re going to need to lay out what we think our pipeline looks like, and what else we need.

pipeI’m going all out simple here, giving an independent cycle for every part of the pipeline. We don’t execute the next instruction until the current one has traversed every stage. We still don’t know whats happening in the fetch and memory stages, but fairly certain of the others. We add a read stage to be sure our data is ready for the ALU when needed – we can always remove states later when we optimize the completed design.

Since all of the units we’ve created so far have enable ports, we can get a control unit to synchronize everything up and drive those enable bits. The control unit will have one output, a bitmask showing the pipeline state currently active, as well as a reset and clock input. Each clock cycle the state will increment by one bit, and if reset is high it will reset to initial state. The control unit is technically a state machine, but for now it’s simple enough we can just classify it as a counter. You can do this with integers or other types, I’m just a fan of raw bit vectors.

entity controlsimple is
  Port ( I_clk : in  STD_LOGIC;
         I_reset : in  STD_LOGIC;
         O_state : out  STD_LOGIC_VECTOR (3 downto 0)
         );
end controlsimple;

architecture Behavioral of controlsimple is
  signal s_state: STD_LOGIC_VECTOR(3 downto 0) := "0001";
begin
  process(I_clk)
  begin
    if rising_edge(I_clk) then
      if I_reset = '1' then
        s_state <= "0001";
      else
        case s_state is
          when "0001" =>
            s_state <= "0010";
          when "0010" =>
            s_state <= "0100";
          when "0100" =>
            s_state <= "1000";
          when "1000" =>
            s_state <= "0001";
          when others =>
            s_state <= "0001";
        end case;
      end if;
    end if;
  end process;

  O_state <= s_state;
end Behavioral;

We can create a new testbench based off of out current decode, alu, register one, and add in the simplecontrol component. I won’t post all the code to that here, but you can see it on github. We attach the enable lines of the components to the relevant bits of our state output:

en_decode <= state(0);
en_regread <= state(1);
en_alu <= state(2);
en_regwrite <= state(3);

The register file is a bit different in terms of the enable bit hookup, as to accommodate two states in the pipeline. The enable bit is tied to both read and write states, but we ensure writes only happen at the correct stage by using it in the write enable input.

uut_reg: reg16_8 PORT MAP (
  I_clk => I_clk,
  I_en => en_regread or en_regwrite,
  I_dataD => dataResult,
  O_dataA => dataA,
  O_dataB => dataB,
  I_selA => selA,
  I_selB => selB,
  I_selD => selD,
  I_we => dataWriteReg and en_regwrite
);

Now we need slight changes to the test bench process itself. We populate our first instruction whilst the control unit is in a reset state, before enabling it. Then, instead of waiting after each instruction is assigned by a specific number of clock cycles, we wait until the register writeback state is reached. This is the last part of the pipeline, and is the MSB of our state output from the control unit – signal en_regwrite in our test. At this point, it’s safe to assign the next instruction to the decoder input. We repeat this for all instructions like so:

reset <= '1'; -- reset control unit
--load.h r0,0xfe
instruction <= OPCODE_LOAD & "000" & '0' & X"fe";
reset <= '0'; -- enable/start control unit
wait until en_regwrite = '1';

--load.l r1, 0xed
instruction <= OPCODE_LOAD & "001" & '1' & X"ed";
wait until en_regwrite = '1';

--or r2, r0, r1
instruction <= OPCODE_OR & "010" & '0' & "000" & "001" & "00";
wait until en_regwrite = '1';

--load.l r3, 1
instruction <= OPCODE_LOAD & "011" & '1' & X"01";
wait until en_regwrite = '1';

--load.l r4, 2
instruction <= OPCODE_LOAD & "100" & '1' & X"02";
wait until en_regwrite = '1';

--add.u r3, r3, r4
instruction <= OPCODE_ADD & "011" & '0' & "011" & "100" & "00";
wait until en_regwrite = '1';

--or r5, r0, r3
instruction <= OPCODE_OR & "101" & '0' & "000" & "011" & "00";
wait until en_regwrite = '1';

control_1Running this in the simulator, we get a nice output waveform. All instructions executed correctly, and ran perfectly in sync. The registers at the end of the simulation are listed in the image above, with a red line linking it to when that particular register was written. I’ve annotated the pipeline stages too. You can see that the next instruction gets assigned to the decoder input in the writeback stage, so there is a little bit of overlap in the pipeline, but that’s to be expected. It’s not real concurrent operation within the pipeline, it’s just ensuring the data is available when it’s needed.

Wrapping Up

Hopefully this has shown the importance of the control unit, and how it conducts the process of execution in this CPU. It has many more functions to manage – especially when we start implementing memory operations, and dealing with that shouldBranch ALU output – more on that later!

Thanks for reading, comments as always to @domipheus.