Designing a CPU in VHDL, Part 4: The ALU, Comparisons and Branching

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

This is a little disclaimer that the VHDL here is probably not the best you will see, but it gets the job done – in the simulator, at least. If you spot any serious errors, or woeful performance gotchas I’ve fallen for – please let me know at @domipheus.

The ALU is the next component to implement. It should take the opcode given by the decoder, along with input data read from the register file, and output a result that can then be written to the register file.

aluIn TPU, we are only doing single cycle operations. Whilst a whole instruction will take multiple cycles to complete, the ALU itself will output a result in a single clock cycle. The other cycles are due to decoding, fetching and any writebacks of results which are required.

Because of this fact, you can implement an ALU very simply by performing every operation possible and then select and output only the result we require, given the ‘aluop’ forwarded by the decoder. Another way (the way I have done) is to have a case statement and assign to an internal register the operation of choice, before then writing the contents of that register to the output ports. Underneath I think they would be implemented similarly, with a great many multiplexers.

A result isn’t the only thing the ALU should provide. Some of the operations implemented in TPU are branch/jump instructions, some conditional. So, we also need a signal stating whether or not we should take a branch. We also have our own copy of the register file write enable bit – so we can stop the result being written based on a condition.

The VHDL Module

We create a new VHDL module named ALU, with the following ports:

entity alu is Port (
    I_clk : in  STD_LOGIC;
    I_en :  in  STD_LOGIC;
    I_dataA :   in STD_LOGIC_VECTOR (15 downto 0);
    I_dataB :   in STD_LOGIC_VECTOR (15 downto 0);
    I_dataDwe : in STD_LOGIC;
    I_aluop :   in STD_LOGIC_VECTOR (4 downto 0);
    I_PC :      in STD_LOGIC_VECTOR (15 downto 0);
    I_dataIMM : in STD_LOGIC_VECTOR (15 downto 0);
    O_dataResult :   out  STD_LOGIC_VECTOR (15 downto 0);
    O_dataWriteReg : out STD_LOGIC;
    O_shouldBranch : out std_logic
end alu;

Our ALU process which runs on a rising clock edge, when I_en is active, will immediately enter a case statement dependent on the alu operation forwarded from the decoder. Each when block within the case statement will write to elements of an internal register s_result, which is 18 bits wide, to accommodate carry/overflow status. There is also an internal signal for the shouldBranch output.

The following output is the ALU behavior, minus all operations except add.

architecture Behavioral of alu is
  -- The internal register for results of operations.
  -- 16 bit + carry/overflow
  signal s_result: STD_LOGIC_VECTOR(17 downto 0) := (others => '0');
  signal s_shouldBranch: STD_LOGIC := '0';
  process (I_clk, I_en)
    if rising_edge(I_clk) and I_en = '1' then
      O_dataWriteReg <= I_dataDwe;
      case I_aluop(4 downto 1) is
        when OPCODE_ADD =>
          if I_aluop(0) = '0' then
            s_result(16 downto 0) <= std_logic_vector(unsigned('0' & I_dataA) + unsigned( '0' & I_dataB));
            s_result(16 downto 0) <= std_logic_vector(signed(I_dataA(15) & I_dataA) + signed( I_dataB(15) & I_dataB));
          end if;
          s_shouldBranch <= '0';

        -- ...Other OPCODEs go here...
        when others =>
        s_result <= "00" & X"FEFE";
      end case;
    end if;
  end process;

  O_dataResult <= s_result(15 downto 0);
  O_shouldBranch <= s_shouldBranch;

end Behavioral;


The Add operation is the first opcode. It has signed and unsigned variants, dependent on the instruction flag which is bit 0 of the aluop input. If the flag is 0, the register values are resized to 17-bit, cast to integer data and an addition performed. If the flag bit is set to indicate signed operands, sign-extension needs to be performed instead of a resize – meaning we concatenate the last bit of the register data to become the 17th bit of our operand. The Addition is performed, and result written back to 17 bits of s_result. this provides data needed for a carry bit in unsigned mode and an overflow bit in signed mode, but we do not use them yet (and overflow needs some more logic). The last thing which is done is that we set the shouldBranch signal to ‘0’, as we should not branch at this time.

The other logical operations are simple, implemented in the following fashion:

when OPCODE_OR =>
  s_result(15 downto 0) <= I_dataA or I_dataB;
  s_shouldBranch <= '0';

The LOAD operation uses the flag in bit 8 to indicate whether an 8-bit immediate value should be placed in the high or low half of the destination register:

  if I_aluop(0) = '0' then
    s_result(15 downto 0) <= I_dataIMM(7 downto 0) & X"00";
    s_result(15 downto 0) <= X"00" & I_dataIMM(7 downto 0);
  end if;
  s_shouldBranch <= '0';

The other bits are zero’d out to form our 16-bit result.

The Compare Instruction

Most typical, older 16bit CPUs tend to set status bits in the ALU when certian events occur. On Z80 the status bits change when a large defined set of operations are executed, others have an explicit compare instruction which sets some status bits. Those bits can then be queried via special instructions which read direct from the internal status register. TPU does things a bit differently in that it has very little internal status, and writes the results of compares to a register via an explicit instruction. The register contains a series of bits which indicate the result of a boolean operation. The instruction (from the ISA document) is as follows: cmp_inst_isaThe compare instruction in TPU is of form RRR, taking a destination register and two sources. The flag bit is to do either signed or unsigned comparisons. The instruction calculates rA == rB, rA > rB, rA < rB, rA == 0 and rB ==0 and writes the rersult into specific bit locations of the output register rD. We don’t care for the other bits, so they should be zero’d.

There isn’t much reason for why it’s set up like this, apart from that I wanted to do multiple conditions at once, and get a bitmask out to do further logic on the result if need be. With so few registers, there is a fair argument that this could be a stupid path to take.

The CMP instruction has some extra constants we need to define – the bit locations for the output of those comparisons.

-- cmp output bits
constant CMP_BIT_EQ:  integer := 14;
constant CMP_BIT_AGB: integer := 13;
constant CMP_BIT_ALB: integer := 12;
constant CMP_BIT_AZ:  integer := 11;
constant CMP_BIT_BZ:  integer := 10;

I’ve created a vhdl package file, where I place all constants for TPU. You can include the constants file using the library and use statements, like

library work;
use work.tpu_constants.all;

With output bit locations defined per the ISA, we can start writing into the intermediate s_result.

when OPCODE_CMP =>
  if I_dataA = I_dataB then
    s_result(CMP_BIT_EQ) <= '1';
    s_result(CMP_BIT_EQ) <= '0';
  end if;

  if I_dataA = X"0000" then
    s_result(CMP_BIT_AZ) <= '1';
    s_result(CMP_BIT_AZ) <= '0';
  end if;

  if I_dataB = X"0000" then
    s_result(CMP_BIT_BZ) <= '1';
    s_result(CMP_BIT_BZ) <= '0';
  end if;

These operations don’t care whether the comparisons are signed or not, so we can just do them and write out the correct bit. I tried doing a conditional move here, instead of the big chunk of annoying if/else/assign1or0 logic but it seems that’s only in a newer version of VHDL. So I’ll just keep it at this for now. The final two comparisons are the greater/less than, which do depend on sign. So we check the flag bit for signed mode before writing out the results of the comparison.

if I_aluop(0) = '0' then
  if unsigned(I_dataA) > unsigned(I_dataB) then
    s_result(CMP_BIT_AGB) <= '1';
    s_result(CMP_BIT_AGB) <= '0';
  end if;
  if unsigned(I_dataA) < unsigned(I_dataB) then
    s_result(CMP_BIT_ALB) <= '1';
    s_result(CMP_BIT_ALB) <= '0';
  end if;
  if signed(I_dataA) > signed(I_dataB) then
    s_result(CMP_BIT_AGB) <= '1';
    s_result(CMP_BIT_AGB) <= '0';
  end if;
  if signed(I_dataA) < signed(I_dataB) then
    s_result(CMP_BIT_ALB) <= '1';
    s_result(CMP_BIT_ALB) <= '0';
  end if;
end if;

Finally, we zero out the remaining, unused bits – and set our shouldbranch output to 0.

s_result(15) <= '0';
s_result(9 downto 0) <= "0000000000";
s_shouldBranch <= '0';

And that’s it done!


Shifts are cumbersome. Currently, I’ve only defined logical, unsigned shifts – by a value stored in a register. That will need to change, but there are enough flag bits for that to work. Originally I used a fixed-length loop with an early out, which was synthesizable, but upon testing the result just never seemed correct. So now, I’m left with the following, which seems horrendous but works.

when OPCODE_SHL =>
  case I_dataB(3 downto 0) is
    when "0001" =>
      s_result(15 downto 0) <= std_logic_vector(shift_left(unsigned(I_dataA), 1));
    when "0010" =>
      s_result(15 downto 0) <= std_logic_vector(shift_left(unsigned(I_dataA), 2));
    when "0011" =>
      s_result(15 downto 0) <= std_logic_vector(shift_left(unsigned(I_dataA), 3));
    when "0100" =>
      s_result(15 downto 0) <= std_logic_vector(shift_left(unsigned(I_dataA), 4));
    when "0101" =>
      s_result(15 downto 0) <= std_logic_vector(shift_left(unsigned(I_dataA), 5));
    when "0110" =>
      s_result(15 downto 0) <= std_logic_vector(shift_left(unsigned(I_dataA), 6));
    when "0111" =>
      s_result(15 downto 0) <= std_logic_vector(shift_left(unsigned(I_dataA), 7));
    when "1000" =>
      s_result(15 downto 0) <= std_logic_vector(shift_left(unsigned(I_dataA), 8));
    when "1001" =>
      s_result(15 downto 0) <= std_logic_vector(shift_left(unsigned(I_dataA), 9));
    when "1010" =>
      s_result(15 downto 0) <= std_logic_vector(shift_left(unsigned(I_dataA), 10));
    when "1011" =>
      s_result(15 downto 0) <= std_logic_vector(shift_left(unsigned(I_dataA), 11));
    when "1100" =>
      s_result(15 downto 0) <= std_logic_vector(shift_left(unsigned(I_dataA), 12));
    when "1101" =>
      s_result(15 downto 0) <= std_logic_vector(shift_left(unsigned(I_dataA), 13));
    when "1110" =>
      s_result(15 downto 0) <= std_logic_vector(shift_left(unsigned(I_dataA), 14));
    when "1111" =>
      s_result(15 downto 0) <= std_logic_vector(shift_left(unsigned(I_dataA), 15));
    when others =>
      s_result(15 downto 0) <= I_dataA;
  end case;
  s_shouldBranch <= '0';

Euugh. It passes the tests. Ship it.

Everybody Jump! Jump! Jump!

Branching, Jumping, Changing the program counter. It’s something that needs to happen at some point. We have a few instructions available for this. A jump to an 8-bit immediate value, and a jump to a 16-bit register value. Those are pretty simple, and set the s_shouldBranch flag to ‘1’, and fill out the s_result with our branch target.

Conditional jumps are the same, really – but the s_shouldBranch flag is dependant on some condition. Looking at the ISA we can see the various conditions available. jumpeq_inst_isaThe C bits are quite a mess here, grouped from the signed flag and the two unused bits at the end of the instruction. I’ll probably change this, allowing for the rD selection bits to be forwarded on to the ALU in some way, possibly as a second ‘function flag’ immediate, to allow their use. That would enable some variant of jump conditionally to relative offset instructions – which are going to be needed when working in nested loops. Without relative immediate conditional jumps, there will be severe pressure on the registers, leaving very few available for actual computation.

Implementation wise, this function is pretty simple – it sets the shouldBranch output to the correct bit in the CMP input.

  -- set branch target regardless
  s_result(15 downto 0) <= I_dataB;

   -- the condition to jump is based on aluop(0) and dataimm(1 downto 0);
  case (I_aluop(0) & I_dataIMM(1 downto 0)) is
    when CJF_EQ =>
      s_shouldBranch <= I_dataA(CMP_BIT_EQ);
    when CJF_AZ =>
      s_shouldBranch <= I_dataA(CMP_BIT_Az);
    when CJF_BZ =>
      s_shouldBranch <= I_dataA(CMP_BIT_Bz);
    when CJF_ANZ =>
      s_shouldBranch <= not I_dataA(CMP_BIT_AZ);
    when CJF_BNZ =>
      s_shouldBranch <= not I_dataA(CMP_BIT_Bz);
    when CJF_AGB =>
      s_shouldBranch <= I_dataA(CMP_BIT_AGB);
    when CJF_ALB =>
      s_shouldBranch <= I_dataA(CMP_BIT_ALB);
    when others =>
        s_shouldBranch <= '0';
  end case;

That’s all the ALU needs to do, the actual branching logic happens elsewhere, and we’ll talk about that another time.


Testing the ALU is easy, but there are so many combinations of input to output that getting good coverage would take a considerable amount of time. Testing the ALU in isolation is easy – simply set the various inputs and expect the correct output, making sure it takes only a single clock cycle.

I_dataA <= X"0001";
I_dataB <= X"0002";
I_aluop <= OPCODE_ADD & '0';
I_dataIMM <= X"F1FA";

wait for I_clk_period;
I_dataA <= X"0005";
I_dataB <= X"0003";
I_aluop <= OPCODE_SUB & '0';

wait for I_clk_period;

I_dataA <= X"FEEE";
I_dataB <= X"0000";
I_aluop <= OPCODE_CMP & '0';

wait for I_clk_period;

Above we have tests for unsigned addition, subtraction, and compare. We can simulate this, and view the data output to verify operation. A larger test is on the github repo, but again, coverage isn’t great. It’s enough to progress, we can validate later (*cough*).

test_simWrapping up

And there we have it, a simple ALU. I’ve finally got the TPU repo up on github, it’s fairly sparse just now, as the ISE project isn’t included. Only the sources which have been previously mentioned are there. The ISA is also there, although obviously incomplete.

Thanks for reading, comments as always to @domipheus, and the next part should be finally starting to look into how these various units work together.


Comments are closed.