Domipheus Labs

Stuff that interests Colin ‘Domipheus’ Riley

Content follows this message
If you have enjoyed my articles, please consider these charities for donation:
  • Young Lives vs Cancer - Donate.
  • Blood Cancer UK - Donate.
  • Children's Cancer and Leukaemia Group - Donate.

Designing a CPU in VHDL, Part 6: Program Counter, Instruction Fetch, Branching

Posted Jul 23, 2015, Reading time: 12 minutes.

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

The last part paved the way for getting this simple CPU self sustaining. This means that the test bench doesn’t feed instructions into the decoder, the CPU itself requests and fetches from a RAM somewhere. There is quite a bit of work to this in terms of the various ways of connecting up a RAM to the CPU and having a unit to manage the program counter and the fetching of the next one.

The Program Counter unit

The PC is just a register containing the location of the currently executing instruction. However, we need to operate on it, so will create a unit to manage this.

Our PC unit will obviously hold the current PC, and on command increment it. It will have an input for setting the next PC value, and also the ability to stop – stay at the same location – which we need due to our pipeline being several cycles long.

With different possibilities for the PC unit operating, we can have an input to it which dictates the operation on the PC to perform:

We can use a 2-bit opcode input to select one of these operations. Our PC unit then looks like this functional unit.

We get back into the Xilinx ISE project, adding out new pc_unit.vhd with the various input and output ports we need.

entity pc_unit is
    Port ( I_clk : in  STD_LOGIC;
           I_nPC : in  STD_LOGIC_VECTOR (15 downto 0);
           I_nPCop : in  STD_LOGIC_VECTOR (1 downto 0);
           O_PC : out  STD_LOGIC_VECTOR (15 downto 0)
           );
end pc_unit;

The process triggered by the rising edge of the input clock will check the input I_nPCop port for an opcode on which to execute on the PC value. We use a case statement like previous examples in the series. We also have an internal signal to act as a register holding the current PC.

architecture Behavioral of pc_unit is
  signal current_pc: std_logic_vector( 15 downto 0) := X"0000";
begin

  process (I_clk)
  begin
    if rising_edge(I_clk) then
      case I_nPCop is
        when PCU_OP_NOP => 	-- NOP, keep PC the same/halt
        when PCU_OP_INC => 	-- increment
          current_pc <= std_logic_vector(unsigned(current_pc) + 1);
        when PCU_OP_ASSIGN => 	-- set from external input
          current_pc <= I_nPC;
        when PCU_OP_RESET => 	-- Reset
          current_pc <= X"0000";
        when others =>
      end case;
    end if;
  end process;

  O_PC <= current_pc;

end Behavioral;

The constants are in tpu_constants.vhd with the others:

-- PC unit opcodes
constant PCU_OP_NOP: std_logic_vector(1 downto 0):= "00";
constant PCU_OP_INC: std_logic_vector(1 downto 0):= "01";
constant PCU_OP_ASSIGN: std_logic_vector(1 downto 0):= "10";
constant PCU_OP_RESET: std_logic_vector(1 downto 0):= "11";

This simple unit should allow us to reset the PC, increment during normal operation, branch to a new location when we need to, and also halt and spin on the same PC value.

Integrating a new Testbench

Early in the series we got a test ‘RAM’ made up. We’re going to take that code, and copy it into a copy of our existing test bench – but we will initialize the RAM with our test instruction stream. It will look like the following:

entity ram_tb is
    Port ( I_clk : in  STD_LOGIC;
           I_we : in  STD_LOGIC;
           I_addr : in  STD_LOGIC_VECTOR (15 downto 0);
           I_data : in  STD_LOGIC_VECTOR (15 downto 0);
           O_data : out  STD_LOGIC_VECTOR (15 downto 0));
end ram_tb;

architecture Behavioral of ram_tb is
  type store_t is array (0 to 7) of std_logic_vector(15 downto 0);
   signal ram: store_t := (
    OPCODE_LOAD & "000" & '0' & X"fe",
    OPCODE_LOAD & "001" & '1' & X"ed",
    OPCODE_OR & "010" & '0' & "000" & "001" & "00",
    OPCODE_LOAD & "011" & '1' & X"01",
    OPCODE_LOAD & "100" & '1' & X"02",
    OPCODE_ADD & "011" & '0' & "011" & "100" & "00",
    OPCODE_OR & "101" & '0' & "000" & "011" & "00",
    OPCODE_AND & "101" & '0' & "101" & "010" & "00"
    );
begin

  process (I_clk)
  begin
    if rising_edge(I_clk) then
      if (I_we = '1') then
        ram(to_integer(unsigned(I_addr(2 downto 0)))) <= I_data;
      else
        O_data <= ram(to_integer(unsigned(I_addr(2 downto 0))));
      end if;
    end if;
  end process;

end Behavioral;

We can then initialize the unit in our main test bench body. The PC unit should be added too.

component ram_tb
Port ( I_clk : in  STD_LOGIC;
      I_we : in  STD_LOGIC;
      I_addr : in  STD_LOGIC_VECTOR (15 downto 0);
      I_data : in  STD_LOGIC_VECTOR (15 downto 0);
      O_data : out  STD_LOGIC_VECTOR (15 downto 0)
      );
end component;

COMPONENT pc_unit
PORT(
     I_clk : IN  std_logic;
     I_nPC : IN  std_logic_vector(15 downto 0);
     I_nPCop : IN  std_logic_vector(1 downto 0);
     O_PC : OUT std_logic_vector(15 downto 0)
    );
END COMPONENT;


signal ramWE : std_logic := '0';
signal ramAddr: std_logic_vector(15 downto 0);
signal ramRData: std_logic_vector(15 downto 0);
signal ramWData: std_logic_vector(15 downto 0);

signal nPC: std_logic_vector(15 downto 0);
signal pcop: std_logic_vector(1 downto 0);
signal in_pc: std_logic_vector(15 downto 0);

After the begin statement, we define the parts and map the ports to respective signals.

uut_ram: ram_tb Port map (
  I_clk => I_clk,
  I_we => ramWE,
  I_addr => ramAddr,
  I_data => ramWData,
  O_data => ramRData
);
uut_pcunit: pc_unit Port map (
  I_clk => I_clk,
  I_nPC => in_pc,
  I_nPCop => pcop,
  O_PC => PC
);

We don’t use the ramWData or ramWE ports, so assign them to some default values. the ramAddr port should always be assigned to PC, and the instruction port we used to set manually we will assign from ramRData – the output from our ram. We put these assignments outside of a process – they are always true.

ramAddr <= PC;
ramWData <= X"FFFF";
ramWE <= '0';
instruction <= ramRData;

The last input we need to figure out is the PC unit operation. What we need to do is tie this to the current control unit state – in that, we want it to increment at a given point, and we want the rest of the time to be in the NOP mode.

pcop<= PCU_OP_RESET when reset = '1' else PCU_OP_INC when state(2) = '1' else PCU_OP_NOP;

This resets the PC unit when there is a control unit reset, increments on the ALU execute stage of the pipeline, and nops otherwise.

As for the testbench itself, we no longer feed the instruction signal with our instructions, so we remove all that. We have a short reset, and then wait until our PC is 8 – as then we have overflowed memory, and our test is complete. The ‘CPU’ should be self sustaining, execute everything up until the PC equals 0x8, then go into an indefinite reset state.

stim_proc: process
begin		

  reset<='1'; -- reset control unit
  wait for I_clk_period; -- wait a cycles
  reset <= '0';

  wait until PC = X"0008"; -- we only have 8 instructions
  reset <= '1'; -- reset control unit
  wait;

end process;

Getting back to the PC unit, the point in the pipeline when we increment is very important. We need to account for any latency, for all uses of the PC – as shown in the next two simulation captures.

You can see when the increment happens at the writeback phase, we have an issue that the instructions are delayed and subsequently offset from each other. The problem here is that the increment needs to happen at the writeback, as the ALU in a branch instruction could issue an assign to PC – and in the first image (at yellow circles) you can see where we’d want to increment/set the ALU has not got it’s result ready.

The Fetch Stage

Previously we had a cloud of uncertainty around the fetch area of the instruction pipeline. We need to add it now, to account for any latencies in fetching the opcode for the decode stage. I’m adding it by extending our simple control unit, with two extra bits of state. The second bit will be for the memory stage of the pipeline which we’ll discuss later, but best to add it now. Simply extend the std_logic_vector, add the extra case statements, and extend the outputs where required. We ignore the last state switch – the memory stage. In the test bench source, we need to add an en_fetch and also change en_decode, en_regread, etc, to reflect new bit positions. They are just shifted up one.

If we change our assignment to pcop, to now inc on the writeback phase (bit 4 now, due to fetch now being bit 0):

pcop <= PCU_OP_RESET when reset = '1' else
        PCU_OP_INC when state(4) = '1' else
        PCU_OP_NOP;

Simulating it results in the following output:

A) increment PC, on previous writeback cycle. 0x1 now on ram address bus.
B) ram puts value at 0x1 on read output bus, appears at decoder instruction input in time for decode cycle.
C) decoder outputs results from decode
D) result from alu ready for writeback.

We can see this extra fetch cycle of latency allows for the increment/PC operation to occur as late as possible, with the ALU result also available when this happens.

Branching

So now, what about branching? We need to set the pc to the output of the alu, if shouldBranch is true. This is, surprisingly, very easy to achieve. We simply amend the assignment to the pcop to add the assign condition:

pcop <= PCU_OP_RESET when reset = '1' else
        PCU_OP_ASSIGN when shouldBranch = '1' and state(4) = '1' else
        PCU_OP_INC when shouldBranch = '0' and state(4) = '1' else
        PCU_OP_NOP;

We will change our last instruction in the RAM to jump back to our addition, to make an infinite loop.

signal ram: store_t := (
  OPCODE_LOAD & "000" & '0' & X"fe",
  OPCODE_LOAD & "001" & '1' & X"ed",
  OPCODE_OR & "010" & '0' & "000" & "001" & "00",
  OPCODE_LOAD & "011" & '1' & X"01",
  OPCODE_LOAD & "100" & '1' & X"02",
  OPCODE_ADD & "011" & '0' & "011" & "100" & "00",
  OPCODE_OR & "101" & '0' & "000" & "011" & "00",
  OPCODE_JUMP & "000" & '1' & X"05"
  );

We also need to remove the ‘wait until PC = X”0008″;’ from the testbench. The test code effectively becomes just a reset and a wait. We must also now set the PC input to the PC unit to our ALU output, so we can assign the PC to new values.

in_pc <= dataResult;

stim_proc: process
 begin		

  reset<='1'; -- reset control unit
  wait for I_clk_period; -- wait a cycles
  reset <= '0';

  wait;
end process;

And that will then produce this rather fun simulation – self sustaining, with branching!

I thought it would be useful to go and zoom into the branch on that simulation, and step through the transitions.

At (A), the ALU has decided we should branch. As we are in the writeback phase of the pipeline, pcop becomes assign instead of increment at (B), putting dataresult on in_pc. the assign happens, and the next cycle, now in fetch (C), we have the new instruction being decoded from our branch target.

To show issues that can occur without correct timing, we have a simulation without the fetch stage and PC operation at the ALU stage. The branch happens, eventually, but the CPU executes an additional instruction first. Some architectures actually do this, and you need to fill out the instruction stream after branches with nops or other useful operations. Look up Delay Slots if you are interested in learning more. For the TPU design, I didn’t want this.

Conditional branching

We may as well check whether conditional branches work. For this, we just create a new test with a larger RAM, and fill it with the following code:

signal ram: store_t := (
  OPCODE_LOAD & "001" & '1' & X"05",
  OPCODE_LOAD & "010" & '1' & X"03",
  OPCODE_XOR & "000" & '0' & "000" & "000" & "00",
  OPCODE_LOAD & "011" & '1' & X"01",
  OPCODE_LOAD & "110" & '1' & X"0b",
  OPCODE_OR & "100" & '0' & "010" & "010" & "00",
  OPCODE_CMP & "101" & '0' & "100" & "000" & "00",
  OPCODE_JUMPEQ & "000" & '0' & "101" & "110" & "01",
  OPCODE_SUB & "100" & '0' & "100" & "011" & "00",
  OPCODE_ADD & "000" & '0' & "000" & "001" & "00",
  OPCODE_JUMP & "000" & '1' & X"06",
  OPCODE_JUMP & "000" & '1' & X"0b",
  X"0000",
  X"0000",
  X"0000",
  X"0000"
);

This is code for a simple multiply, with the operands in r1 and r2, the result going in r0.

0x00    load.l r1, 0x05
0x01    load.l r2, 0x03
0x02    xor    r0, r0, r0
0x03    load.l r3, 0x01
0x04    load.l r6, $END
0x05    or     r4, r2, r2
      loop:
0x06    cmp.u    r5, r4, r0
0x07    jump.aez r5, r6
0x08    sub.u    r4, r4, r3
0x09    add.u    r0, r0, r1
0x0a    jump $loop
      end:
0x0b    jump $end

It simulates, and it works. You can see the branch structure from the simulation quite easily, and the result is valid.

There are a few ISA issues apparent when we start drawing up ‘more complex’ TPU assembly examples. Having some sort of simple add/sub immediate and a conditional jump to relative offset would reduce our instruction count. But, we can always revisit and change our ISA – one of the benefits of rolling your own cpu!

Wrapping Up

So, we managed to get TPU self-sustaining – that is, it chooses which instructions to consume, and does so indefinitely. The test bench essentially becomes a wait, as the CPU decides what to do based on the instruction stream – It’s a real CPU!

To do this, we had to add our fetch stage to the pipeline, and also investigate issues with when changes to the PC occurred. We added our PC unit to track this, and it’s fairly simple but allows for normal sequential execution and branching.

An issue with our CPU is that it’s still relatively limited by number of registers, and the fact we cannot get any real input other than instructions. To get more, we need to give the CPU access to manipulate memory. And with that, possibly memory-mapped peripherals. So we’ll look into that next.

Thanks for reading, comments as always to @domipheus.