Designing a CPU in VHDL, Part 6: Program Counter, Instruction Fetch, Branching

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

The last part paved the way for getting this simple CPU self sustaining. This means that the test bench doesn’t feed instructions into the decoder, the CPU itself requests and fetches from a RAM somewhere. There is quite a bit of work to this in terms of the various ways of connecting up a RAM to the CPU and having a unit to manage the program counter and the fetching of the next one.

The Program Counter unit

The PC is just a register containing the location of the currently executing instruction. However, we need to operate on it, so will create a unit to manage this.

Our PC unit will obviously hold the current PC, and on command increment it. It will have an input for setting the next PC value, and also the ability to stop – stay at the same location – which we need due to our pipeline being several cycles long.

With different possibilities for the PC unit operating, we can have an input to it which dictates the operation on the PC to perform:

  • Increment PC
  • Set PC to new value
  • Do nothing (halt)
  • Set PC to our reset vector, which is 0x0000.

We can use a 2-bit opcode input to select one of these operations. Our PC unit then looks like this functional unit.

pcunit_functionalWe get back into the Xilinx ISE project, adding out new pc_unit.vhd with the various input and output ports we need.

entity pc_unit is
    Port ( I_clk : in  STD_LOGIC;
           I_nPC : in  STD_LOGIC_VECTOR (15 downto 0);
           I_nPCop : in  STD_LOGIC_VECTOR (1 downto 0);
           O_PC : out  STD_LOGIC_VECTOR (15 downto 0)
           );
end pc_unit;

The process triggered by the rising edge of the input clock will check the input I_nPCop port for an opcode on which to execute on the PC value. We use a case statement like previous examples in the series. We also have an internal signal to act as a register holding the current PC.

architecture Behavioral of pc_unit is
  signal current_pc: std_logic_vector( 15 downto 0) := X"0000";
begin

  process (I_clk)
  begin
    if rising_edge(I_clk) then
      case I_nPCop is
        when PCU_OP_NOP => 	-- NOP, keep PC the same/halt
        when PCU_OP_INC => 	-- increment
          current_pc <= std_logic_vector(unsigned(current_pc) + 1);
        when PCU_OP_ASSIGN => 	-- set from external input
          current_pc <= I_nPC;
        when PCU_OP_RESET => 	-- Reset
          current_pc <= X"0000";
        when others =>
      end case;
    end if;
  end process;

  O_PC <= current_pc;

end Behavioral;

The constants are in tpu_constants.vhd with the others:

-- PC unit opcodes
constant PCU_OP_NOP: std_logic_vector(1 downto 0):= "00";
constant PCU_OP_INC: std_logic_vector(1 downto 0):= "01";
constant PCU_OP_ASSIGN: std_logic_vector(1 downto 0):= "10";
constant PCU_OP_RESET: std_logic_vector(1 downto 0):= "11";

This simple unit should allow us to reset the PC, increment during normal operation, branch to a new location when we need to, and also halt and spin on the same PC value.

Integrating a new Testbench

Early in the series we got a test ‘RAM’ made up. We’re going to take that code, and copy it into a copy of our existing test bench – but we will initialize the RAM with our test instruction stream. It will look like the following:

entity ram_tb is
    Port ( I_clk : in  STD_LOGIC;
           I_we : in  STD_LOGIC;
           I_addr : in  STD_LOGIC_VECTOR (15 downto 0);
           I_data : in  STD_LOGIC_VECTOR (15 downto 0);
           O_data : out  STD_LOGIC_VECTOR (15 downto 0));
end ram_tb;

architecture Behavioral of ram_tb is
  type store_t is array (0 to 7) of std_logic_vector(15 downto 0);
   signal ram: store_t := (
    OPCODE_LOAD & "000" & '0' & X"fe",
    OPCODE_LOAD & "001" & '1' & X"ed",
    OPCODE_OR & "010" & '0' & "000" & "001" & "00",
    OPCODE_LOAD & "011" & '1' & X"01",
    OPCODE_LOAD & "100" & '1' & X"02",
    OPCODE_ADD & "011" & '0' & "011" & "100" & "00",
    OPCODE_OR & "101" & '0' & "000" & "011" & "00",
    OPCODE_AND & "101" & '0' & "101" & "010" & "00"
    );
begin

  process (I_clk)
  begin
    if rising_edge(I_clk) then
      if (I_we = '1') then
        ram(to_integer(unsigned(I_addr(2 downto 0)))) <= I_data;
      else
        O_data <= ram(to_integer(unsigned(I_addr(2 downto 0))));
      end if;
    end if;
  end process;

end Behavioral;

We can then initialize the unit in our main test bench body. The PC unit should be added too.

component ram_tb
Port ( I_clk : in  STD_LOGIC;
      I_we : in  STD_LOGIC;
      I_addr : in  STD_LOGIC_VECTOR (15 downto 0);
      I_data : in  STD_LOGIC_VECTOR (15 downto 0);
      O_data : out  STD_LOGIC_VECTOR (15 downto 0)
      );
end component;

COMPONENT pc_unit
PORT(
     I_clk : IN  std_logic;
     I_nPC : IN  std_logic_vector(15 downto 0);
     I_nPCop : IN  std_logic_vector(1 downto 0);
     O_PC : OUT std_logic_vector(15 downto 0)
    );
END COMPONENT;


signal ramWE : std_logic := '0';
signal ramAddr: std_logic_vector(15 downto 0);
signal ramRData: std_logic_vector(15 downto 0);
signal ramWData: std_logic_vector(15 downto 0);

signal nPC: std_logic_vector(15 downto 0);
signal pcop: std_logic_vector(1 downto 0);
signal in_pc: std_logic_vector(15 downto 0);

After the begin statement, we define the parts and map the ports to respective signals.

uut_ram: ram_tb Port map (
  I_clk => I_clk,
  I_we => ramWE,
  I_addr => ramAddr,
  I_data => ramWData,
  O_data => ramRData
);
uut_pcunit: pc_unit Port map (
  I_clk => I_clk,
  I_nPC => in_pc,
  I_nPCop => pcop,
  O_PC => PC
);

We don’t use the ramWData or ramWE ports, so assign them to some default values. the ramAddr port should always be assigned to PC, and the instruction port we used to set manually we will assign from ramRData – the output from our ram. We put these assignments outside of a process – they are always true.

ramAddr <= PC;
ramWData <= X"FFFF";
ramWE <= '0';
instruction <= ramRData;

The last input we need to figure out is the PC unit operation. What we need to do is tie this to the current control unit state – in that, we want it to increment at a given point, and we want the rest of the time to be in the NOP mode.

pcop<= PCU_OP_RESET when reset = '1' else PCU_OP_INC when state(2) = '1' else PCU_OP_NOP;

This resets the PC unit when there is a control unit reset, increments on the ALU execute stage of the pipeline, and nops otherwise.

As for the testbench itself, we no longer feed the instruction signal with our instructions, so we remove all that. We have a short reset, and then wait until our PC is 8 – as then we have overflowed memory, and our test is complete. The ‘CPU’ should be self sustaining, execute everything up until the PC equals 0x8, then go into an indefinite reset state.

stim_proc: process
begin		

  reset<='1'; -- reset control unit
  wait for I_clk_period; -- wait a cycles
  reset <= '0';

  wait until PC = X"0008"; -- we only have 8 instructions
  reset <= '1'; -- reset control unit
  wait;

end process;

Getting back to the PC unit, the point in the pipeline when we increment is very important. We need to account for any latency, for all uses of the PC – as shown in the next two simulation captures.

latest_onlyPC_inc_at_exec_no_fetch latest_onlyPC_inc_at_writeback_no_fetchYou can see when the increment happens at the writeback phase, we have an issue that the instructions are delayed and subsequently offset from each other. The problem here is that the increment needs to happen at the writeback, as the ALU in a branch instruction could issue an assign to PC – and in the first image (at yellow circles) you can see where we’d want to increment/set the ALU has not got it’s result ready.

The Fetch Stage

Previously we had a cloud of uncertainty around the fetch area of the instruction pipeline. We need to add it now, to account for any latencies in fetching the opcode for the decode stage. I’m adding it by extending our simple control unit, with two extra bits of state. The second bit will be for the memory stage of the pipeline which we’ll discuss later, but best to add it now. Simply extend the std_logic_vector, add the extra case statements, and extend the outputs where required. We ignore the last state switch – the memory stage. In the test bench source, we need to add an en_fetch and also change en_decode, en_regread, etc, to reflect new bit positions. They are just shifted up one.

If we change our assignment to pcop, to now inc on the writeback phase (bit 4 now, due to fetch now being bit 0):

pcop <= PCU_OP_RESET when reset = '1' else
        PCU_OP_INC when state(4) = '1' else
        PCU_OP_NOP;

Simulating it results in the following output:

fetchstage_inc_writebackA) increment PC, on previous writeback cycle. 0x1 now on ram address bus.
B) ram puts value at 0x1 on read output bus, appears at decoder instruction input in time for decode cycle.
C) decoder outputs results from decode
D) result from alu ready for writeback.

We can see this extra fetch cycle of latency allows for the increment/PC operation to occur as late as possible, with the ALU result also available when this happens.

Branching

So now, what about branching? We need to set the pc to the output of the alu, if shouldBranch is true. This is, surprisingly, very easy to achieve. We simply amend the assignment to the pcop to add the assign condition:

pcop <= PCU_OP_RESET when reset = '1' else
        PCU_OP_ASSIGN when shouldBranch = '1' and state(4) = '1' else
        PCU_OP_INC when shouldBranch = '0' and state(4) = '1' else
        PCU_OP_NOP;

We will change our last instruction in the RAM to jump back to our addition, to make an infinite loop.

signal ram: store_t := (
  OPCODE_LOAD & "000" & '0' & X"fe",
  OPCODE_LOAD & "001" & '1' & X"ed",
  OPCODE_OR & "010" & '0' & "000" & "001" & "00",
  OPCODE_LOAD & "011" & '1' & X"01",
  OPCODE_LOAD & "100" & '1' & X"02",
  OPCODE_ADD & "011" & '0' & "011" & "100" & "00",
  OPCODE_OR & "101" & '0' & "000" & "011" & "00",
  OPCODE_JUMP & "000" & '1' & X"05"
  );

We also need to remove the ‘wait until PC = X”0008″;’ from the testbench. The test code effectively becomes just a reset and a wait. We must also now set the PC input to the PC unit to our ALU output, so we can assign the PC to new values.


in_pc <= dataResult;

stim_proc: process
 begin		

  reset<='1'; -- reset control unit
  wait for I_clk_period; -- wait a cycles
  reset <= '0';

  wait;
end process;

And that will then produce this rather fun simulation – self sustaining, with branching!

self_sustain_branchingI thought it would be useful to go and zoom into the branch on that simulation, and step through the transitions.

latest_onlyPC_inc_at_write_with_fetch_see_correct_branch_2At (A), the ALU has decided we should branch. As we are in the writeback phase of the pipeline, pcop becomes assign instead of increment at (B), putting dataresult on in_pc. the assign happens, and the next cycle, now in fetch (C), we have the new instruction being decoded from our branch target.

To show issues that can occur without correct timing, we have a simulation without the fetch stage and PC operation at the ALU stage. The branch happens, eventually, but the CPU executes an additional instruction first. Some architectures actually do this, and you need to fill out the instruction stream after branches with nops or other useful operations. Look up Delay Slots if you are interested in learning more. For the TPU design, I didn’t want this.

latest_onlyPC_inc_at_exec_no_fetch_see_wrong_branchConditional branching

We may as well check whether conditional branches work. For this, we just create a new test with a larger RAM, and fill it with the following code:

signal ram: store_t := (
  OPCODE_LOAD & "001" & '1' & X"05",
  OPCODE_LOAD & "010" & '1' & X"03",
  OPCODE_XOR & "000" & '0' & "000" & "000" & "00",
  OPCODE_LOAD & "011" & '1' & X"01",
  OPCODE_LOAD & "110" & '1' & X"0b",
  OPCODE_OR & "100" & '0' & "010" & "010" & "00",
  OPCODE_CMP & "101" & '0' & "100" & "000" & "00",
  OPCODE_JUMPEQ & "000" & '0' & "101" & "110" & "01",
  OPCODE_SUB & "100" & '0' & "100" & "011" & "00",
  OPCODE_ADD & "000" & '0' & "000" & "001" & "00",
  OPCODE_JUMP & "000" & '1' & X"06",
  OPCODE_JUMP & "000" & '1' & X"0b",
  X"0000",
  X"0000",
  X"0000",
  X"0000"
);

This is code for a simple multiply, with the operands in r1 and r2, the result going in r0.

0x00    load.l r1, 0x05
0x01    load.l r2, 0x03
0x02    xor    r0, r0, r0
0x03    load.l r3, 0x01
0x04    load.l r6, $END
0x05    or     r4, r2, r2
      loop:
0x06    cmp.u    r5, r4, r0
0x07    jump.aez r5, r6
0x08    sub.u    r4, r4, r3
0x09    add.u    r0, r0, r1
0x0a    jump $loop
      end:
0x0b    jump $end

It simulates, and it works. You can see the branch structure from the simulation quite easily, and the result is valid.

self_sustain_conditional_branchingThere are a few ISA issues apparent when we start drawing up ‘more complex’ TPU assembly examples. Having some sort of simple add/sub immediate and a conditional jump to relative offset would reduce our instruction count. But, we can always revisit and change our ISA – one of the benefits of rolling your own cpu!

Wrapping Up

So, we managed to get TPU self-sustaining – that is, it chooses which instructions to consume, and does so indefinitely. The test bench essentially becomes a wait, as the CPU decides what to do based on the instruction stream – It’s a real CPU!

To do this, we had to add our fetch stage to the pipeline, and also investigate issues with when changes to the PC occurred. We added our PC unit to track this, and it’s fairly simple but allows for normal sequential execution and branching.

An issue with our CPU is that it’s still relatively limited by number of registers, and the fact we cannot get any real input other than instructions. To get more, we need to give the CPU access to manipulate memory. And with that, possibly memory-mapped peripherals. So we’ll look into that next.

Thanks for reading, comments as always to @domipheus.

Designing a CPU in VHDL, Part 5: Pipeline and Control Unit

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

This is a disclaimer that the VHDL here is probably not the best you will see, but it gets the job done – in the simulator, at least. If you spot any serious errors, or woeful performance gotchas I’ve fallen for – please let me know at @domipheus. The aim of these posts is to get a very simple 16-bit CPU up and running, and then get stuck into some optimization opportunities later.

We now have our decoder, ALU, and registers. We can now create a test bench to simulate what happens when we connect those together in a fixed, controllable way. The testbench source is on github, so I won’t go too much into creating the boilerplate. Create an ALU test bench, and then manually add the decoder, register file, and associated signals to the file.

Joining functional units together

With our ALU, decoder and regs entities available, we can connect them together with signals like so.

uut_decoder: decode PORT MAP (
    I_clk => I_clk,
    I_dataInst => instruction,
    I_en => en,
    O_selA => selA,
    O_selB => selB,
    O_selD => selD,
    O_dataIMM => dataIMM,
    O_regDwe => dataDwe,
    O_aluop => aluop
  );

uut_alu: alu PORT MAP (
    I_clk => I_clk,
    I_en => en,
    I_dataA => dataA,
    I_dataB => dataB,
    I_dataDwe => dataDwe,
    I_aluop => aluop,
    I_PC => PC,
    I_dataIMM => dataIMM,
    O_dataResult => dataResult,
    O_dataWriteReg => dataWriteReg,
    O_shouldBranch => shouldBranch
  );

uut_reg: reg16_8 PORT MAP (
    I_clk => I_clk,
    I_en => '1',
    I_dataD => dataResult,
    O_dataA => dataA,
    O_dataB => dataB,
    I_selA => selA,
    I_selB => selB,
    I_selD => selD,
    I_we => dataWriteReg
  );

If we include the tpu_constants file, we can use the opcode definitions to write the instruction signal before waiting a cycle. We also need to remember to enable the units using the enable ‘en’ signal, which is connected to all units.

-- Stimulus process
stim_proc: process
begin
  -- hold reset state for 100 ns.
  wait for 100 ns;	

  wait for I_clk_period*10;
  en<='1';

  --load.h r0,0xfe
  instruction <= OPCODE_LOAD & "000" & '0' & X"fe";
  wait for I_clk_period;

  --load.l r1, 0xed
  instruction <= OPCODE_LOAD & "001" & '1' & X"ed";
  wait for I_clk_period;

  --or r2, r0, r1
  instruction <= OPCODE_OR & "010" & '0' & "000" & "001" & "00";
  wait for I_clk_period;
  wait;
end process;

When simulating this, you get the wrong result. It’s fairly understandable why – we’ve designed these units to be part of a pipeline – and we’re only giving a single cycle of latency for a result at the end of it, when we need at least two cycles.

first_run_1cycleIf we edit our test to add two cycles of latency, we get a much better result.

  --load.h r0,0xfe
  instruction <= OPCODE_LOAD & "000" & '0' & X"fe";
  wait for I_clk_period;
  wait for I_clk_period;

  --load.l r1, 0xed
  instruction <= OPCODE_LOAD & "001" & '1' & X"ed";
  wait for I_clk_period;
  wait for I_clk_period;

  --or r2, r0, r1
  instruction <= OPCODE_OR & "010" & '0' & "000" & "001" & "00";
  wait for I_clk_period;
  wait for I_clk_period;

2_first_run_2cycleLooking at the memory view, for our uut_reg object, we can see r0, r1 and r2 have the correct data.

2_first_run_2cycle_regmemSuccess! But what about the following example?

--load.l r3, 1
instruction <= OPCODE_LOAD & "011" & '1' & X"01";
wait for I_clk_period;
wait for I_clk_period;

--load.l r4, 2
instruction <= OPCODE_LOAD & "100" & '1' & X"02";
wait for I_clk_period;
wait for I_clk_period;

--add.u r3, r3, r4
instruction <= OPCODE_ADD & "011" & '0' & "011" & "100" & "00";
wait for I_clk_period;
wait for I_clk_period;

You will see one item of interest here. The add instruction uses the same register as a source and destination. We tested the register file before, as we know reading from/writing to the same register is technically fine. But what when it’s part of a bigger system? It doesn’t go so well.

wrong_write_depI’ve annotated the simulation timeline with the instructions and two points of interest. Point A is the result of the add – it’s delayed by a cycle. This offsets the result for any potential next instruction – which you can see at point B.

--or r5, r0, r3
instruction <= OPCODE_OR & "101" & '0' & "000" & "011" & "00";
wait for I_clk_period;
wait for I_clk_period;

If we add the above instruction to the mix we can see how serious of an issue this can be.

wrong_write_dep_proofThe correct write into r3 doesn’t happen in time for the OR operation, so the result in r5 will be incorrect. And we can see this by looking at the uut_reg unit in memory view.

wrong_or_reg_r3We can see that with the units designed as they are, 3 cycles will be needed for safe decode – execute – writeback. Adding a third wait for I_clk_period; after the add instruction, which allows time for the writeback into the register, allows for correct operation.

right_r3_extra_writeback_cycleControl Unit

What we need for all of this, isn’t to add cycles of latency like above, but to add a unit responsible for telling other units what to do, and when. To figure out what we want, we’re going to need to lay out what we think our pipeline looks like, and what else we need.

pipeI’m going all out simple here, giving an independent cycle for every part of the pipeline. We don’t execute the next instruction until the current one has traversed every stage. We still don’t know whats happening in the fetch and memory stages, but fairly certain of the others. We add a read stage to be sure our data is ready for the ALU when needed – we can always remove states later when we optimize the completed design.

Since all of the units we’ve created so far have enable ports, we can get a control unit to synchronize everything up and drive those enable bits. The control unit will have one output, a bitmask showing the pipeline state currently active, as well as a reset and clock input. Each clock cycle the state will increment by one bit, and if reset is high it will reset to initial state. The control unit is technically a state machine, but for now it’s simple enough we can just classify it as a counter. You can do this with integers or other types, I’m just a fan of raw bit vectors.

entity controlsimple is
  Port ( I_clk : in  STD_LOGIC;
         I_reset : in  STD_LOGIC;
         O_state : out  STD_LOGIC_VECTOR (3 downto 0)
         );
end controlsimple;

architecture Behavioral of controlsimple is
  signal s_state: STD_LOGIC_VECTOR(3 downto 0) := "0001";
begin
  process(I_clk)
  begin
    if rising_edge(I_clk) then
      if I_reset = '1' then
        s_state <= "0001";
      else
        case s_state is
          when "0001" =>
            s_state <= "0010";
          when "0010" =>
            s_state <= "0100";
          when "0100" =>
            s_state <= "1000";
          when "1000" =>
            s_state <= "0001";
          when others =>
            s_state <= "0001";
        end case;
      end if;
    end if;
  end process;

  O_state <= s_state;
end Behavioral;

We can create a new testbench based off of out current decode, alu, register one, and add in the simplecontrol component. I won’t post all the code to that here, but you can see it on github. We attach the enable lines of the components to the relevant bits of our state output:

en_decode <= state(0);
en_regread <= state(1);
en_alu <= state(2);
en_regwrite <= state(3);

The register file is a bit different in terms of the enable bit hookup, as to accommodate two states in the pipeline. The enable bit is tied to both read and write states, but we ensure writes only happen at the correct stage by using it in the write enable input.

uut_reg: reg16_8 PORT MAP (
  I_clk => I_clk,
  I_en => en_regread or en_regwrite,
  I_dataD => dataResult,
  O_dataA => dataA,
  O_dataB => dataB,
  I_selA => selA,
  I_selB => selB,
  I_selD => selD,
  I_we => dataWriteReg and en_regwrite
);

Now we need slight changes to the test bench process itself. We populate our first instruction whilst the control unit is in a reset state, before enabling it. Then, instead of waiting after each instruction is assigned by a specific number of clock cycles, we wait until the register writeback state is reached. This is the last part of the pipeline, and is the MSB of our state output from the control unit – signal en_regwrite in our test. At this point, it’s safe to assign the next instruction to the decoder input. We repeat this for all instructions like so:

reset <= '1'; -- reset control unit
--load.h r0,0xfe
instruction <= OPCODE_LOAD & "000" & '0' & X"fe";
reset <= '0'; -- enable/start control unit
wait until en_regwrite = '1';

--load.l r1, 0xed
instruction <= OPCODE_LOAD & "001" & '1' & X"ed";
wait until en_regwrite = '1';

--or r2, r0, r1
instruction <= OPCODE_OR & "010" & '0' & "000" & "001" & "00";
wait until en_regwrite = '1';

--load.l r3, 1
instruction <= OPCODE_LOAD & "011" & '1' & X"01";
wait until en_regwrite = '1';

--load.l r4, 2
instruction <= OPCODE_LOAD & "100" & '1' & X"02";
wait until en_regwrite = '1';

--add.u r3, r3, r4
instruction <= OPCODE_ADD & "011" & '0' & "011" & "100" & "00";
wait until en_regwrite = '1';

--or r5, r0, r3
instruction <= OPCODE_OR & "101" & '0' & "000" & "011" & "00";
wait until en_regwrite = '1';

control_1Running this in the simulator, we get a nice output waveform. All instructions executed correctly, and ran perfectly in sync. The registers at the end of the simulation are listed in the image above, with a red line linking it to when that particular register was written. I’ve annotated the pipeline stages too. You can see that the next instruction gets assigned to the decoder input in the writeback stage, so there is a little bit of overlap in the pipeline, but that’s to be expected. It’s not real concurrent operation within the pipeline, it’s just ensuring the data is available when it’s needed.

Wrapping Up

Hopefully this has shown the importance of the control unit, and how it conducts the process of execution in this CPU. It has many more functions to manage – especially when we start implementing memory operations, and dealing with that shouldBranch ALU output – more on that later!

Thanks for reading, comments as always to @domipheus.

Designing a CPU in VHDL, Part 4: The ALU, Comparisons and Branching

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

This is a little disclaimer that the VHDL here is probably not the best you will see, but it gets the job done – in the simulator, at least. If you spot any serious errors, or woeful performance gotchas I’ve fallen for – please let me know at @domipheus.

The ALU is the next component to implement. It should take the opcode given by the decoder, along with input data read from the register file, and output a result that can then be written to the register file.

aluIn TPU, we are only doing single cycle operations. Whilst a whole instruction will take multiple cycles to complete, the ALU itself will output a result in a single clock cycle. The other cycles are due to decoding, fetching and any writebacks of results which are required.

Because of this fact, you can implement an ALU very simply by performing every operation possible and then select and output only the result we require, given the ‘aluop’ forwarded by the decoder. Another way (the way I have done) is to have a case statement and assign to an internal register the operation of choice, before then writing the contents of that register to the output ports. Underneath I think they would be implemented similarly, with a great many multiplexers.

A result isn’t the only thing the ALU should provide. Some of the operations implemented in TPU are branch/jump instructions, some conditional. So, we also need a signal stating whether or not we should take a branch. We also have our own copy of the register file write enable bit – so we can stop the result being written based on a condition.

The VHDL Module

We create a new VHDL module named ALU, with the following ports:

entity alu is Port (
    I_clk : in  STD_LOGIC;
    I_en :  in  STD_LOGIC;
    I_dataA :   in STD_LOGIC_VECTOR (15 downto 0);
    I_dataB :   in STD_LOGIC_VECTOR (15 downto 0);
    I_dataDwe : in STD_LOGIC;
    I_aluop :   in STD_LOGIC_VECTOR (4 downto 0);
    I_PC :      in STD_LOGIC_VECTOR (15 downto 0);
    I_dataIMM : in STD_LOGIC_VECTOR (15 downto 0);
    O_dataResult :   out  STD_LOGIC_VECTOR (15 downto 0);
    O_dataWriteReg : out STD_LOGIC;
    O_shouldBranch : out std_logic
  );
end alu;

Our ALU process which runs on a rising clock edge, when I_en is active, will immediately enter a case statement dependent on the alu operation forwarded from the decoder. Each when block within the case statement will write to elements of an internal register s_result, which is 18 bits wide, to accommodate carry/overflow status. There is also an internal signal for the shouldBranch output.

The following output is the ALU behavior, minus all operations except add.

architecture Behavioral of alu is
  -- The internal register for results of operations.
  -- 16 bit + carry/overflow
  signal s_result: STD_LOGIC_VECTOR(17 downto 0) := (others => '0');
  signal s_shouldBranch: STD_LOGIC := '0';
begin
  process (I_clk, I_en)
  begin
    if rising_edge(I_clk) and I_en = '1' then
      O_dataWriteReg <= I_dataDwe;
      case I_aluop(4 downto 1) is
        when OPCODE_ADD =>
          if I_aluop(0) = '0' then
            s_result(16 downto 0) <= std_logic_vector(unsigned('0' & I_dataA) + unsigned( '0' & I_dataB));
          else
            s_result(16 downto 0) <= std_logic_vector(signed(I_dataA(15) & I_dataA) + signed( I_dataB(15) & I_dataB));
          end if;
          s_shouldBranch <= '0';

        -- ...Other OPCODEs go here...
        when others =>
        s_result <= "00" & X"FEFE";
      end case;
    end if;
  end process;

  O_dataResult <= s_result(15 downto 0);
  O_shouldBranch <= s_shouldBranch;

end Behavioral;

Operations

The Add operation is the first opcode. It has signed and unsigned variants, dependent on the instruction flag which is bit 0 of the aluop input. If the flag is 0, the register values are resized to 17-bit, cast to integer data and an addition performed. If the flag bit is set to indicate signed operands, sign-extension needs to be performed instead of a resize – meaning we concatenate the last bit of the register data to become the 17th bit of our operand. The Addition is performed, and result written back to 17 bits of s_result. this provides data needed for a carry bit in unsigned mode and an overflow bit in signed mode, but we do not use them yet (and overflow needs some more logic). The last thing which is done is that we set the shouldBranch signal to ‘0’, as we should not branch at this time.

The other logical operations are simple, implemented in the following fashion:

when OPCODE_OR =>
  s_result(15 downto 0) <= I_dataA or I_dataB;
  s_shouldBranch <= '0';

The LOAD operation uses the flag in bit 8 to indicate whether an 8-bit immediate value should be placed in the high or low half of the destination register:

when OPCODE_LOAD =>
  if I_aluop(0) = '0' then
    s_result(15 downto 0) <= I_dataIMM(7 downto 0) & X"00";
  else
    s_result(15 downto 0) <= X"00" & I_dataIMM(7 downto 0);
  end if;
  s_shouldBranch <= '0';

The other bits are zero’d out to form our 16-bit result.

The Compare Instruction

Most typical, older 16bit CPUs tend to set status bits in the ALU when certian events occur. On Z80 the status bits change when a large defined set of operations are executed, others have an explicit compare instruction which sets some status bits. Those bits can then be queried via special instructions which read direct from the internal status register. TPU does things a bit differently in that it has very little internal status, and writes the results of compares to a register via an explicit instruction. The register contains a series of bits which indicate the result of a boolean operation. The instruction (from the ISA document) is as follows: cmp_inst_isaThe compare instruction in TPU is of form RRR, taking a destination register and two sources. The flag bit is to do either signed or unsigned comparisons. The instruction calculates rA == rB, rA > rB, rA < rB, rA == 0 and rB ==0 and writes the rersult into specific bit locations of the output register rD. We don’t care for the other bits, so they should be zero’d.

There isn’t much reason for why it’s set up like this, apart from that I wanted to do multiple conditions at once, and get a bitmask out to do further logic on the result if need be. With so few registers, there is a fair argument that this could be a stupid path to take.

The CMP instruction has some extra constants we need to define – the bit locations for the output of those comparisons.

-- cmp output bits
constant CMP_BIT_EQ:  integer := 14;
constant CMP_BIT_AGB: integer := 13;
constant CMP_BIT_ALB: integer := 12;
constant CMP_BIT_AZ:  integer := 11;
constant CMP_BIT_BZ:  integer := 10;

I’ve created a vhdl package file, where I place all constants for TPU. You can include the constants file using the library and use statements, like

library work;
use work.tpu_constants.all;

With output bit locations defined per the ISA, we can start writing into the intermediate s_result.

when OPCODE_CMP =>
  if I_dataA = I_dataB then
    s_result(CMP_BIT_EQ) <= '1';
  else
    s_result(CMP_BIT_EQ) <= '0';
  end if;

  if I_dataA = X"0000" then
    s_result(CMP_BIT_AZ) <= '1';
  else
    s_result(CMP_BIT_AZ) <= '0';
  end if;

  if I_dataB = X"0000" then
    s_result(CMP_BIT_BZ) <= '1';
  else
    s_result(CMP_BIT_BZ) <= '0';
  end if;

These operations don’t care whether the comparisons are signed or not, so we can just do them and write out the correct bit. I tried doing a conditional move here, instead of the big chunk of annoying if/else/assign1or0 logic but it seems that’s only in a newer version of VHDL. So I’ll just keep it at this for now. The final two comparisons are the greater/less than, which do depend on sign. So we check the flag bit for signed mode before writing out the results of the comparison.

if I_aluop(0) = '0' then
  if unsigned(I_dataA) > unsigned(I_dataB) then
    s_result(CMP_BIT_AGB) <= '1';
  else
    s_result(CMP_BIT_AGB) <= '0';
  end if;
  if unsigned(I_dataA) < unsigned(I_dataB) then
    s_result(CMP_BIT_ALB) <= '1';
  else
    s_result(CMP_BIT_ALB) <= '0';
  end if;
else
  if signed(I_dataA) > signed(I_dataB) then
    s_result(CMP_BIT_AGB) <= '1';
  else
    s_result(CMP_BIT_AGB) <= '0';
  end if;
  if signed(I_dataA) < signed(I_dataB) then
    s_result(CMP_BIT_ALB) <= '1';
  else
    s_result(CMP_BIT_ALB) <= '0';
  end if;
end if;

Finally, we zero out the remaining, unused bits – and set our shouldbranch output to 0.

s_result(15) <= '0';
s_result(9 downto 0) <= "0000000000";
s_shouldBranch <= '0';

And that’s it done!

Shifts

Shifts are cumbersome. Currently, I’ve only defined logical, unsigned shifts – by a value stored in a register. That will need to change, but there are enough flag bits for that to work. Originally I used a fixed-length loop with an early out, which was synthesizable, but upon testing the result just never seemed correct. So now, I’m left with the following, which seems horrendous but works.

when OPCODE_SHL =>
  case I_dataB(3 downto 0) is
    when "0001" =>
      s_result(15 downto 0) <= std_logic_vector(shift_left(unsigned(I_dataA), 1));
    when "0010" =>
      s_result(15 downto 0) <= std_logic_vector(shift_left(unsigned(I_dataA), 2));
    when "0011" =>
      s_result(15 downto 0) <= std_logic_vector(shift_left(unsigned(I_dataA), 3));
    when "0100" =>
      s_result(15 downto 0) <= std_logic_vector(shift_left(unsigned(I_dataA), 4));
    when "0101" =>
      s_result(15 downto 0) <= std_logic_vector(shift_left(unsigned(I_dataA), 5));
    when "0110" =>
      s_result(15 downto 0) <= std_logic_vector(shift_left(unsigned(I_dataA), 6));
    when "0111" =>
      s_result(15 downto 0) <= std_logic_vector(shift_left(unsigned(I_dataA), 7));
    when "1000" =>
      s_result(15 downto 0) <= std_logic_vector(shift_left(unsigned(I_dataA), 8));
    when "1001" =>
      s_result(15 downto 0) <= std_logic_vector(shift_left(unsigned(I_dataA), 9));
    when "1010" =>
      s_result(15 downto 0) <= std_logic_vector(shift_left(unsigned(I_dataA), 10));
    when "1011" =>
      s_result(15 downto 0) <= std_logic_vector(shift_left(unsigned(I_dataA), 11));
    when "1100" =>
      s_result(15 downto 0) <= std_logic_vector(shift_left(unsigned(I_dataA), 12));
    when "1101" =>
      s_result(15 downto 0) <= std_logic_vector(shift_left(unsigned(I_dataA), 13));
    when "1110" =>
      s_result(15 downto 0) <= std_logic_vector(shift_left(unsigned(I_dataA), 14));
    when "1111" =>
      s_result(15 downto 0) <= std_logic_vector(shift_left(unsigned(I_dataA), 15));
    when others =>
      s_result(15 downto 0) <= I_dataA;
  end case;
  s_shouldBranch <= '0';

Euugh. It passes the tests. Ship it.

Everybody Jump! Jump! Jump!

Branching, Jumping, Changing the program counter. It’s something that needs to happen at some point. We have a few instructions available for this. A jump to an 8-bit immediate value, and a jump to a 16-bit register value. Those are pretty simple, and set the s_shouldBranch flag to ‘1’, and fill out the s_result with our branch target.

Conditional jumps are the same, really – but the s_shouldBranch flag is dependant on some condition. Looking at the ISA we can see the various conditions available. jumpeq_inst_isaThe C bits are quite a mess here, grouped from the signed flag and the two unused bits at the end of the instruction. I’ll probably change this, allowing for the rD selection bits to be forwarded on to the ALU in some way, possibly as a second ‘function flag’ immediate, to allow their use. That would enable some variant of jump conditionally to relative offset instructions – which are going to be needed when working in nested loops. Without relative immediate conditional jumps, there will be severe pressure on the registers, leaving very few available for actual computation.

Implementation wise, this function is pretty simple – it sets the shouldBranch output to the correct bit in the CMP input.

when OPCODE_JUMPEQ =>
  -- set branch target regardless
  s_result(15 downto 0) <= I_dataB;

   -- the condition to jump is based on aluop(0) and dataimm(1 downto 0);
  case (I_aluop(0) & I_dataIMM(1 downto 0)) is
    when CJF_EQ =>
      s_shouldBranch <= I_dataA(CMP_BIT_EQ);
    when CJF_AZ =>
      s_shouldBranch <= I_dataA(CMP_BIT_Az);
    when CJF_BZ =>
      s_shouldBranch <= I_dataA(CMP_BIT_Bz);
    when CJF_ANZ =>
      s_shouldBranch <= not I_dataA(CMP_BIT_AZ);
    when CJF_BNZ =>
      s_shouldBranch <= not I_dataA(CMP_BIT_Bz);
    when CJF_AGB =>
      s_shouldBranch <= I_dataA(CMP_BIT_AGB);
    when CJF_ALB =>
      s_shouldBranch <= I_dataA(CMP_BIT_ALB);
    when others =>
        s_shouldBranch <= '0';
  end case;

That’s all the ALU needs to do, the actual branching logic happens elsewhere, and we’ll talk about that another time.

Testing

Testing the ALU is easy, but there are so many combinations of input to output that getting good coverage would take a considerable amount of time. Testing the ALU in isolation is easy – simply set the various inputs and expect the correct output, making sure it takes only a single clock cycle.

I_dataA <= X"0001";
I_dataB <= X"0002";
I_aluop <= OPCODE_ADD & '0';
I_dataIMM <= X"F1FA";

wait for I_clk_period;
I_dataA <= X"0005";
I_dataB <= X"0003";
I_aluop <= OPCODE_SUB & '0';

wait for I_clk_period;

I_dataA <= X"FEEE";
I_dataB <= X"0000";
I_aluop <= OPCODE_CMP & '0';

wait for I_clk_period;

Above we have tests for unsigned addition, subtraction, and compare. We can simulate this, and view the data output to verify operation. A larger test is on the github repo, but again, coverage isn’t great. It’s enough to progress, we can validate later (*cough*).

test_simWrapping up

And there we have it, a simple ALU. I’ve finally got the TPU repo up on github, it’s fairly sparse just now, as the ISE project isn’t included. Only the sources which have been previously mentioned are there. The ISA is also there, although obviously incomplete.

Thanks for reading, comments as always to @domipheus, and the next part should be finally starting to look into how these various units work together.

 

Designing a CPU in VHDL, Part 3: Instruction Set Architecture, Decoder, RAM

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing!

Instruction Set Architecture

The Instruction Set Architecture (ISA) of a CPU defines the set of operations that can be performed, and on what data types. It explains timing, restrictions, and sometimes any hazards or hardware bugs that can present during normal operation. The operations are defined along with the machine language opcodes, instruction forms, register set and execution latencies. It goes into quite some detail – these documents are what compiler engineers use to implement compilers, and also what those writing operating systems need. It’s the ‘bare metal’ interface to the operation of the processor.

For our CPU, we need to define foundations of our ISA before we progress. It has direct consequences for the instruction decoder, as well as ALU and as explained in the last part, the register file.

Requirements

This is a 16 bit machine, and I want to keep things simple. To that end, we use a fixed instruction length. Each instruction will always be 16 bits long, regardless of what it does. This allows our instruction decoder to be much simpler. However, fitting all the information we need into an instruction may be difficult. In the last part, I mentioned we have a register set comprising of 8 16-bit registers, meaning we need 3 bits to address one. I’ve not decided to have any of the registers special, like r0 always being zero. All are addressable.

If you take an add operation C = A + B, to do this we will need to dedicate 9 bits of our 16 just to addressing registers – so it’s quite a lot. The core operations we want from our CPU are basic integer operations and other functions such as compare and memory manipulation.

Operation Function
Add D = A + B
Subtract D = A – B
Bitwise Or D = A or B
Bitwise Xor D = A xor B
Bitwise And D = A and B
Bitwise Not D = not A
Read D = Memory[A]
Write Memory[A] = B
Load D = 8-bit Immediate Value
Compare D = cmp(A, B)
Shift Left D = A << B
Shift Right D = A >> B
Jump/Branch PC = A Register or Immediate Value
Jump/Branch conditionally PC = Register if (condition) else nop

This list of 14 operations is probably enough for our needs at the moment. That means we’d need 4 bits for an opcode to represent these operations. However, we should look at the forms these operations take, and the inputs they require – to see what we can fit into any instruction in terms of additional information.

Instruction Forms

Forms are the patterns instructions take. The Add instruction above requires 3 registers be supplied, the destination, and two sources. However, Bitwise Not only requires two registers. The Load instruction needs a destination register, and also some kind of immediate value. The Branch/jump instruction only takes a single source register, it doesn’t write a register. The list of forms for the operations above are:

Form Overview Example Instruction
RRR OpCode rD, rA, rB Add
RRd OpCode rD, rA Not
RRs OpCode rA, rB Memory Write
R OpCode rA Branch to Register Value
RImm OpCode rD, Immediate Load
Imm OpCode Immediate Branch to Immediate Value

From the above, we can work out what our greatest instruction length needs to be. If we have any immediate value fixed to 8-bits, which seems sensible, the longest instruction form is the OpCode, rD, Immediate instruction – which adds to 4 + 3 + 8 = 15 bits. This leaves us with only a single bit out of our 16-bit instruction length. This instruction form is for the Load instruction, and interestingly, we can use that last single bit very well in this instruction – it can indicate whether you want the immediate value in the high or low 8 bits of the 16 bit register. With that instruction up to 16 bits, it pretty much fixes our OpCode to be 4 bits in length, with additional function modifier bits – like the high/low flag, being elsewhere.

Given these forms, we can try to organize our instruction bits in such a way that common operands line up. This reduces complexity in our decoder – if, for instance, the destination register is always in the same place for instructions which actually perform a register write, there is no need for any conditional logic to select different bits out of the instruction dependent on opcode. We can simply fix it in place, and the bit selection becomes static.

forms_bits_organizationYou can see from the above table that we can line things up very nicely. There are few different options you could pursue – I use the bit organization as above. For example, you could have moved the last 2 unused ‘flag’ bits in the RRR form forward, just after the first flag, and moved the rA and rB registers back in all other instructions. Op00 rD0 FFF rA0 rB0. This allows for an RRRR form, for things like multiply-add. It’s just personal preference that I thought the 2 flag bits were better at the end. Anyway, we don’t even have a multiplier, so a multiply-add is out of the question. If we wanted to add some sort of instruction which required an RRRR form, we could always take the single flag bit and concatenate the 2 end flag bits to form a register select in the decoder stage. Regardless of how the bits are micromanaged, the main thing we want is for there to be as much overlap of bit positions as possible, it makes things very easy to create the decoder. Despite flags being in multiple places we get good overlap, so we progress.

The Decoder

The decoder is super simple for this CPU. Due to sharing bit ranges of instruction forms, we can statically pull bits out from the instruction stream and forward them on to the other units we connect to: The Register file, ALU, and Control unit(s).

The Decoder requires very little: a clock, an enable bit, and the instruction it’s to decode. It’s output is all of the selection lines for the register file (rD, rA and rB), the alu operation (alu_op) to be performed, any immediate value from the instruction, and whether the instruction performs a register write. The alu_op can include flags from the instruction, as well as the opcode, as these are there to change ALU behavior.

decoderWe can get straight into creating some VHDL for the decoder now. We know the widths of the signals we need, and use the methods in the last part to create our module with boilerplate automatically generated. For many of the outputs, we can simply use some assignments of input ranges of the instruction to outputs.

O_selA <= I_dataInst(7 downto 5);
O_selB <= I_dataInst(4 downto 2);
O_selD <= I_dataInst(11 downto 9);
O_dataIMM <= I_dataInst(7 downto 0) & I_dataInst(7 downto 0);
O_aluop <= I_dataInst(15 downto 12) & I_dataInst(8);

You can see some interesting points from that block of assignments. The immediate value is concatenated with itself to form a 16 bit output which the ALU will use (possibly masking some of that out). The ALU operation output O_aluop includes the 4 bit opcode, but also the first flag – as it’s in every instruction form. For the register write enable output we need to assign opcodes to operations. For this I simply assigned numbers from 0-13 to the list of operations above, nothing special. We get the following table, with the register write enable given it’s own column:

OpCode Operation Form WritesRegister? Comments
0000 ADD RRR Yes
0001 SUB RRR Yes
0010 OR RRR Yes
0011 XOR RRR Yes
0100 AND RRR Yes
0101 NOT RRd Yes
0110 READ RRd Yes
0111 WRITE RRs No
1000 LOAD RImm Yes Flag bit indicates high or low load
1001 CMP RRR Yes Flag bit indicates comparison signedness
1010 SHL RRR Yes
1011 SHR RRR Yes
1100 JUMP R, Imm No Flag bit indicates a jump to register or jump to immediate
1101 JUMPEQ RRs No
1110 RESERVED
1111 RESERVED

I’ve added comments about special forms or uses of the flag to switch form. Some of these instructions require additional information, for example, load. With only 16 different opcodes available, there is no point using one of them for differentiating functionality. We will have a single compare instruction, but the flag will define signedness of the operands. Similarly, the flag bit is used in the jump instruction to say whether it’s jump to a register or jump to an immediate. I could have moved the flag bit into the opcode, and made it 5 bits, but I kept my original design for the rest of the CPU implementation. I’ll leave any possible optimizations like that to version 2 of the cpu!

With op-codes and bit locations defined, in the decoder process we can define a statement which assigns ‘0’ to the register write enable output if the opcode is Write, Jump or JumpEq – and ‘1’ otherwise. This can be done in various ways, I did it originally in a case statement, making our decoder module and process the following:

entity decode is
    Port ( I_clk : in  STD_LOGIC;
           I_dataInst : in  STD_LOGIC_VECTOR (15 downto 0);
           I_en : in  STD_LOGIC;
           O_selA : out  STD_LOGIC_VECTOR (2 downto 0);
           O_selB : out  STD_LOGIC_VECTOR (2 downto 0);
           O_selD : out  STD_LOGIC_VECTOR (2 downto 0);
           O_dataIMM : out  STD_LOGIC_VECTOR (15 downto 0);
           O_regDwe : out  STD_LOGIC;
           O_aluop : out  STD_LOGIC_VECTOR (4 downto 0));
end decode;

architecture Behavioral of decode is

begin

  process (I_clk)
  begin
    if rising_edge(I_clk) and I_en = '1' then

      O_selA <= I_dataInst(7 downto 5);
      O_selB <= I_dataInst(4 downto 2);
      O_selD <= I_dataInst(11 downto 9);
      O_dataIMM <= I_dataInst(7 downto 0) & I_dataInst(7 downto 0);
      O_aluop <= I_dataInst(15 downto 12) & I_dataInst(8);

      case I_dataInst(15 downto 12) is
        when "0111" => 	-- WRITE
          O_regDwe <= '0';
        when "1100" => 	-- JUMP
          O_regDwe <= '0';
        when "1101" => 	-- JUMPEQ
          O_regDwe <= '0';
        when others =>
          O_regDwe <= '1';
      end case;
    end if;
  end process;

end Behavioral;

The opcodes and instruction bit locations should be placed in constants so we can change them easily later. Despite the writes to O_selA, O_aluop, etc, being non-conditional, I still placed them in the clock process. This is so I can see the output latch in the simulator easily when the pipeline transition occurs (more on that in a later post).

That is actually the decoder done. I did say it was super simple!

Additional ISA detail

We have our instruction opcodes, a preliminary instruction set, register set, and instruction forms. We also define (again, for simplicity) that ram is 16 bit addressable. I.e, each address holds 2 bytes, and we will always read/write 16 bits at a time from memory. We can change to byte addressing fairly easily later on if need be, but I’ll keep the restriction to only read/write 16 bits at a time. We’ll keep expanding more on the ISA as we move onto the ALU and discuss implementing the various operations – for instance, we can’t talk about latencies and hazards just yet.

RAM

The RAM, for now, is similar to the register set that was implemented before.

entity ram16 is
    Port ( I_clk : in  STD_LOGIC;
           I_we : in  STD_LOGIC;
           I_addr : in  STD_LOGIC_VECTOR (15 downto 0);
           I_data : in  STD_LOGIC_VECTOR (15 downto 0);
           O_data : out  STD_LOGIC_VECTOR (15 downto 0));
end ram16;

architecture Behavioral of ram16 is
	type store_t is array (0 to 31) of std_logic_vector(15 downto 0);
   signal ram_16: store_t := (others => X"0000");
begin

	process (I_clk)
	begin
		if rising_edge(I_clk) then
			if (I_we = '1') then
				ram_16(to_integer(unsigned(I_addr(5 downto 0)))) <= I_data;
			else
				O_data <= ram_16(to_integer(unsigned(I_addr(5 downto 0))));
			end if;
		end if;
	end process;

end Behavioral;

The RAM as implemented above is 64 bytes, in 32 addressable locations. This can be changed easily by editing the store_t type array bounds, and also the read/write statements in the process – making sure the I_addr bits we need are used. At the moment, the memory space will wrap with locations greater than 31. This simple ram definition gets us very far whilst the rest of the CPU is designed and built. Eventually, I want this CPU using the DRAM on the miniSpartan6+ board itself. A note of caution – start off with a small ram if using the above code. It can take a very long time to build/simulate otherwise.

Wrapping up

That’s about enough for this part. It’s worth noting that currently as I write this (with the CPU actually complete – using this fake ram), I’ve got instruction mnemonics and an assembler written in c# which I use for test case generations using the above instruction forms and opcodes. A good few more parts are needed before we get to talk about that, though!

Thanks for reading, comments as always to @domipheus, and the next part should be looking into the ALU. Update: next part available now.