This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing!
Instruction Set Architecture
The Instruction Set Architecture (ISA) of a CPU defines the set of operations that can be performed, and on what data types. It explains timing, restrictions, and sometimes any hazards or hardware bugs that can present during normal operation. The operations are defined along with the machine language opcodes, instruction forms, register set and execution latencies. It goes into quite some detail – these documents are what compiler engineers use to implement compilers, and also what those writing operating systems need. It’s the ‘bare metal’ interface to the operation of the processor.
For our CPU, we need to define foundations of our ISA before we progress. It has direct consequences for the instruction decoder, as well as ALU and as explained in the last part, the register file.
This is a 16 bit machine, and I want to keep things simple. To that end, we use a fixed instruction length. Each instruction will always be 16 bits long, regardless of what it does. This allows our instruction decoder to be much simpler. However, fitting all the information we need into an instruction may be difficult. In the last part, I mentioned we have a register set comprising of 8 16-bit registers, meaning we need 3 bits to address one. I’ve not decided to have any of the registers special, like r0 always being zero. All are addressable.
If you take an add operation C = A + B, to do this we will need to dedicate 9 bits of our 16 just to addressing registers – so it’s quite a lot. The core operations we want from our CPU are basic integer operations and other functions such as compare and memory manipulation.
|Add||D = A + B|
|Subtract||D = A – B|
|Bitwise Or||D = A or B|
|Bitwise Xor||D = A xor B|
|Bitwise And||D = A and B|
|Bitwise Not||D = not A|
|Read||D = Memory[A]|
|Write||Memory[A] = B|
|Load||D = 8-bit Immediate Value|
|Compare||D = cmp(A, B)|
|Shift Left||D = A << B|
|Shift Right||D = A >> B|
|Jump/Branch||PC = A Register or Immediate Value|
|Jump/Branch conditionally||PC = Register if (condition) else nop|
This list of 14 operations is probably enough for our needs at the moment. That means we’d need 4 bits for an opcode to represent these operations. However, we should look at the forms these operations take, and the inputs they require – to see what we can fit into any instruction in terms of additional information.
Forms are the patterns instructions take. The Add instruction above requires 3 registers be supplied, the destination, and two sources. However, Bitwise Not only requires two registers. The Load instruction needs a destination register, and also some kind of immediate value. The Branch/jump instruction only takes a single source register, it doesn’t write a register. The list of forms for the operations above are:
|RRR||OpCode rD, rA, rB||Add|
|RRd||OpCode rD, rA||Not|
|RRs||OpCode rA, rB||Memory Write|
|R||OpCode rA||Branch to Register Value|
|RImm||OpCode rD, Immediate||Load|
|Imm||OpCode Immediate||Branch to Immediate Value|
From the above, we can work out what our greatest instruction length needs to be. If we have any immediate value fixed to 8-bits, which seems sensible, the longest instruction form is the OpCode, rD, Immediate instruction – which adds to 4 + 3 + 8 = 15 bits. This leaves us with only a single bit out of our 16-bit instruction length. This instruction form is for the Load instruction, and interestingly, we can use that last single bit very well in this instruction – it can indicate whether you want the immediate value in the high or low 8 bits of the 16 bit register. With that instruction up to 16 bits, it pretty much fixes our OpCode to be 4 bits in length, with additional function modifier bits – like the high/low flag, being elsewhere.
Given these forms, we can try to organize our instruction bits in such a way that common operands line up. This reduces complexity in our decoder – if, for instance, the destination register is always in the same place for instructions which actually perform a register write, there is no need for any conditional logic to select different bits out of the instruction dependent on opcode. We can simply fix it in place, and the bit selection becomes static.
You can see from the above table that we can line things up very nicely. There are few different options you could pursue – I use the bit organization as above. For example, you could have moved the last 2 unused ‘flag’ bits in the RRR form forward, just after the first flag, and moved the rA and rB registers back in all other instructions. Op00 rD0 FFF rA0 rB0. This allows for an RRRR form, for things like multiply-add. It’s just personal preference that I thought the 2 flag bits were better at the end. Anyway, we don’t even have a multiplier, so a multiply-add is out of the question. If we wanted to add some sort of instruction which required an RRRR form, we could always take the single flag bit and concatenate the 2 end flag bits to form a register select in the decoder stage. Regardless of how the bits are micromanaged, the main thing we want is for there to be as much overlap of bit positions as possible, it makes things very easy to create the decoder. Despite flags being in multiple places we get good overlap, so we progress.
The decoder is super simple for this CPU. Due to sharing bit ranges of instruction forms, we can statically pull bits out from the instruction stream and forward them on to the other units we connect to: The Register file, ALU, and Control unit(s).
The Decoder requires very little: a clock, an enable bit, and the instruction it’s to decode. It’s output is all of the selection lines for the register file (rD, rA and rB), the alu operation (alu_op) to be performed, any immediate value from the instruction, and whether the instruction performs a register write. The alu_op can include flags from the instruction, as well as the opcode, as these are there to change ALU behavior.
We can get straight into creating some VHDL for the decoder now. We know the widths of the signals we need, and use the methods in the last part to create our module with boilerplate automatically generated. For many of the outputs, we can simply use some assignments of input ranges of the instruction to outputs.
O_selA <= I_dataInst(7 downto 5); O_selB <= I_dataInst(4 downto 2); O_selD <= I_dataInst(11 downto 9); O_dataIMM <= I_dataInst(7 downto 0) & I_dataInst(7 downto 0); O_aluop <= I_dataInst(15 downto 12) & I_dataInst(8);
You can see some interesting points from that block of assignments. The immediate value is concatenated with itself to form a 16 bit output which the ALU will use (possibly masking some of that out). The ALU operation output O_aluop includes the 4 bit opcode, but also the first flag – as it’s in every instruction form. For the register write enable output we need to assign opcodes to operations. For this I simply assigned numbers from 0-13 to the list of operations above, nothing special. We get the following table, with the register write enable given it’s own column:
|1000||LOAD||RImm||Yes||Flag bit indicates high or low load|
|1001||CMP||RRR||Yes||Flag bit indicates comparison signedness|
|1100||JUMP||R, Imm||No||Flag bit indicates a jump to register or jump to immediate|
I’ve added comments about special forms or uses of the flag to switch form. Some of these instructions require additional information, for example, load. With only 16 different opcodes available, there is no point using one of them for differentiating functionality. We will have a single compare instruction, but the flag will define signedness of the operands. Similarly, the flag bit is used in the jump instruction to say whether it’s jump to a register or jump to an immediate. I could have moved the flag bit into the opcode, and made it 5 bits, but I kept my original design for the rest of the CPU implementation. I’ll leave any possible optimizations like that to version 2 of the cpu!
With op-codes and bit locations defined, in the decoder process we can define a statement which assigns ‘0’ to the register write enable output if the opcode is Write, Jump or JumpEq – and ‘1’ otherwise. This can be done in various ways, I did it originally in a case statement, making our decoder module and process the following:
entity decode is Port ( I_clk : in STD_LOGIC; I_dataInst : in STD_LOGIC_VECTOR (15 downto 0); I_en : in STD_LOGIC; O_selA : out STD_LOGIC_VECTOR (2 downto 0); O_selB : out STD_LOGIC_VECTOR (2 downto 0); O_selD : out STD_LOGIC_VECTOR (2 downto 0); O_dataIMM : out STD_LOGIC_VECTOR (15 downto 0); O_regDwe : out STD_LOGIC; O_aluop : out STD_LOGIC_VECTOR (4 downto 0)); end decode; architecture Behavioral of decode is begin process (I_clk) begin if rising_edge(I_clk) and I_en = '1' then O_selA <= I_dataInst(7 downto 5); O_selB <= I_dataInst(4 downto 2); O_selD <= I_dataInst(11 downto 9); O_dataIMM <= I_dataInst(7 downto 0) & I_dataInst(7 downto 0); O_aluop <= I_dataInst(15 downto 12) & I_dataInst(8); case I_dataInst(15 downto 12) is when "0111" => -- WRITE O_regDwe <= '0'; when "1100" => -- JUMP O_regDwe <= '0'; when "1101" => -- JUMPEQ O_regDwe <= '0'; when others => O_regDwe <= '1'; end case; end if; end process; end Behavioral;
The opcodes and instruction bit locations should be placed in constants so we can change them easily later. Despite the writes to O_selA, O_aluop, etc, being non-conditional, I still placed them in the clock process. This is so I can see the output latch in the simulator easily when the pipeline transition occurs (more on that in a later post).
That is actually the decoder done. I did say it was super simple!
Additional ISA detail
We have our instruction opcodes, a preliminary instruction set, register set, and instruction forms. We also define (again, for simplicity) that ram is 16 bit addressable. I.e, each address holds 2 bytes, and we will always read/write 16 bits at a time from memory. We can change to byte addressing fairly easily later on if need be, but I’ll keep the restriction to only read/write 16 bits at a time. We’ll keep expanding more on the ISA as we move onto the ALU and discuss implementing the various operations – for instance, we can’t talk about latencies and hazards just yet.
The RAM, for now, is similar to the register set that was implemented before.
entity ram16 is Port ( I_clk : in STD_LOGIC; I_we : in STD_LOGIC; I_addr : in STD_LOGIC_VECTOR (15 downto 0); I_data : in STD_LOGIC_VECTOR (15 downto 0); O_data : out STD_LOGIC_VECTOR (15 downto 0)); end ram16; architecture Behavioral of ram16 is type store_t is array (0 to 31) of std_logic_vector(15 downto 0); signal ram_16: store_t := (others => X"0000"); begin process (I_clk) begin if rising_edge(I_clk) then if (I_we = '1') then ram_16(to_integer(unsigned(I_addr(5 downto 0)))) <= I_data; else O_data <= ram_16(to_integer(unsigned(I_addr(5 downto 0)))); end if; end if; end process; end Behavioral;
The RAM as implemented above is 64 bytes, in 32 addressable locations. This can be changed easily by editing the store_t type array bounds, and also the read/write statements in the process – making sure the I_addr bits we need are used. At the moment, the memory space will wrap with locations greater than 31. This simple ram definition gets us very far whilst the rest of the CPU is designed and built. Eventually, I want this CPU using the DRAM on the miniSpartan6+ board itself. A note of caution – start off with a small ram if using the above code. It can take a very long time to build/simulate otherwise.
That’s about enough for this part. It’s worth noting that currently as I write this (with the CPU actually complete – using this fake ram), I’ve got instruction mnemonics and an assembler written in c# which I use for test case generations using the above instruction forms and opcodes. A good few more parts are needed before we get to talk about that, though!