Designing a CPU in VHDL, Part 3: Instruction Set Architecture, Decoder, RAM

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing!

Instruction Set Architecture

The Instruction Set Architecture (ISA) of a CPU defines the set of operations that can be performed, and on what data types. It explains timing, restrictions, and sometimes any hazards or hardware bugs that can present during normal operation. The operations are defined along with the machine language opcodes, instruction forms, register set and execution latencies. It goes into quite some detail – these documents are what compiler engineers use to implement compilers, and also what those writing operating systems need. It’s the ‘bare metal’ interface to the operation of the processor.

For our CPU, we need to define foundations of our ISA before we progress. It has direct consequences for the instruction decoder, as well as ALU and as explained in the last part, the register file.

Requirements

This is a 16 bit machine, and I want to keep things simple. To that end, we use a fixed instruction length. Each instruction will always be 16 bits long, regardless of what it does. This allows our instruction decoder to be much simpler. However, fitting all the information we need into an instruction may be difficult. In the last part, I mentioned we have a register set comprising of 8 16-bit registers, meaning we need 3 bits to address one. I’ve not decided to have any of the registers special, like r0 always being zero. All are addressable.

If you take an add operation C = A + B, to do this we will need to dedicate 9 bits of our 16 just to addressing registers – so it’s quite a lot. The core operations we want from our CPU are basic integer operations and other functions such as compare and memory manipulation.

Operation Function
Add D = A + B
Subtract D = A – B
Bitwise Or D = A or B
Bitwise Xor D = A xor B
Bitwise And D = A and B
Bitwise Not D = not A
Read D = Memory[A]
Write Memory[A] = B
Load D = 8-bit Immediate Value
Compare D = cmp(A, B)
Shift Left D = A << B
Shift Right D = A >> B
Jump/Branch PC = A Register or Immediate Value
Jump/Branch conditionally PC = Register if (condition) else nop

This list of 14 operations is probably enough for our needs at the moment. That means we’d need 4 bits for an opcode to represent these operations. However, we should look at the forms these operations take, and the inputs they require – to see what we can fit into any instruction in terms of additional information.

Instruction Forms

Forms are the patterns instructions take. The Add instruction above requires 3 registers be supplied, the destination, and two sources. However, Bitwise Not only requires two registers. The Load instruction needs a destination register, and also some kind of immediate value. The Branch/jump instruction only takes a single source register, it doesn’t write a register. The list of forms for the operations above are:

Form Overview Example Instruction
RRR OpCode rD, rA, rB Add
RRd OpCode rD, rA Not
RRs OpCode rA, rB Memory Write
R OpCode rA Branch to Register Value
RImm OpCode rD, Immediate Load
Imm OpCode Immediate Branch to Immediate Value

From the above, we can work out what our greatest instruction length needs to be. If we have any immediate value fixed to 8-bits, which seems sensible, the longest instruction form is the OpCode, rD, Immediate instruction – which adds to 4 + 3 + 8 = 15 bits. This leaves us with only a single bit out of our 16-bit instruction length. This instruction form is for the Load instruction, and interestingly, we can use that last single bit very well in this instruction – it can indicate whether you want the immediate value in the high or low 8 bits of the 16 bit register. With that instruction up to 16 bits, it pretty much fixes our OpCode to be 4 bits in length, with additional function modifier bits – like the high/low flag, being elsewhere.

Given these forms, we can try to organize our instruction bits in such a way that common operands line up. This reduces complexity in our decoder – if, for instance, the destination register is always in the same place for instructions which actually perform a register write, there is no need for any conditional logic to select different bits out of the instruction dependent on opcode. We can simply fix it in place, and the bit selection becomes static.

forms_bits_organizationYou can see from the above table that we can line things up very nicely. There are few different options you could pursue – I use the bit organization as above. For example, you could have moved the last 2 unused ‘flag’ bits in the RRR form forward, just after the first flag, and moved the rA and rB registers back in all other instructions. Op00 rD0 FFF rA0 rB0. This allows for an RRRR form, for things like multiply-add. It’s just personal preference that I thought the 2 flag bits were better at the end. Anyway, we don’t even have a multiplier, so a multiply-add is out of the question. If we wanted to add some sort of instruction which required an RRRR form, we could always take the single flag bit and concatenate the 2 end flag bits to form a register select in the decoder stage. Regardless of how the bits are micromanaged, the main thing we want is for there to be as much overlap of bit positions as possible, it makes things very easy to create the decoder. Despite flags being in multiple places we get good overlap, so we progress.

The Decoder

The decoder is super simple for this CPU. Due to sharing bit ranges of instruction forms, we can statically pull bits out from the instruction stream and forward them on to the other units we connect to: The Register file, ALU, and Control unit(s).

The Decoder requires very little: a clock, an enable bit, and the instruction it’s to decode. It’s output is all of the selection lines for the register file (rD, rA and rB), the alu operation (alu_op) to be performed, any immediate value from the instruction, and whether the instruction performs a register write. The alu_op can include flags from the instruction, as well as the opcode, as these are there to change ALU behavior.

decoderWe can get straight into creating some VHDL for the decoder now. We know the widths of the signals we need, and use the methods in the last part to create our module with boilerplate automatically generated. For many of the outputs, we can simply use some assignments of input ranges of the instruction to outputs.

O_selA <= I_dataInst(7 downto 5);
O_selB <= I_dataInst(4 downto 2);
O_selD <= I_dataInst(11 downto 9);
O_dataIMM <= I_dataInst(7 downto 0) & I_dataInst(7 downto 0);
O_aluop <= I_dataInst(15 downto 12) & I_dataInst(8);

You can see some interesting points from that block of assignments. The immediate value is concatenated with itself to form a 16 bit output which the ALU will use (possibly masking some of that out). The ALU operation output O_aluop includes the 4 bit opcode, but also the first flag – as it’s in every instruction form. For the register write enable output we need to assign opcodes to operations. For this I simply assigned numbers from 0-13 to the list of operations above, nothing special. We get the following table, with the register write enable given it’s own column:

OpCode Operation Form WritesRegister? Comments
0000 ADD RRR Yes
0001 SUB RRR Yes
0010 OR RRR Yes
0011 XOR RRR Yes
0100 AND RRR Yes
0101 NOT RRd Yes
0110 READ RRd Yes
0111 WRITE RRs No
1000 LOAD RImm Yes Flag bit indicates high or low load
1001 CMP RRR Yes Flag bit indicates comparison signedness
1010 SHL RRR Yes
1011 SHR RRR Yes
1100 JUMP R, Imm No Flag bit indicates a jump to register or jump to immediate
1101 JUMPEQ RRs No
1110 RESERVED
1111 RESERVED

I’ve added comments about special forms or uses of the flag to switch form. Some of these instructions require additional information, for example, load. With only 16 different opcodes available, there is no point using one of them for differentiating functionality. We will have a single compare instruction, but the flag will define signedness of the operands. Similarly, the flag bit is used in the jump instruction to say whether it’s jump to a register or jump to an immediate. I could have moved the flag bit into the opcode, and made it 5 bits, but I kept my original design for the rest of the CPU implementation. I’ll leave any possible optimizations like that to version 2 of the cpu!

With op-codes and bit locations defined, in the decoder process we can define a statement which assigns ‘0’ to the register write enable output if the opcode is Write, Jump or JumpEq – and ‘1’ otherwise. This can be done in various ways, I did it originally in a case statement, making our decoder module and process the following:

entity decode is
    Port ( I_clk : in  STD_LOGIC;
           I_dataInst : in  STD_LOGIC_VECTOR (15 downto 0);
           I_en : in  STD_LOGIC;
           O_selA : out  STD_LOGIC_VECTOR (2 downto 0);
           O_selB : out  STD_LOGIC_VECTOR (2 downto 0);
           O_selD : out  STD_LOGIC_VECTOR (2 downto 0);
           O_dataIMM : out  STD_LOGIC_VECTOR (15 downto 0);
           O_regDwe : out  STD_LOGIC;
           O_aluop : out  STD_LOGIC_VECTOR (4 downto 0));
end decode;

architecture Behavioral of decode is

begin

  process (I_clk)
  begin
    if rising_edge(I_clk) and I_en = '1' then

      O_selA <= I_dataInst(7 downto 5);
      O_selB <= I_dataInst(4 downto 2);
      O_selD <= I_dataInst(11 downto 9);
      O_dataIMM <= I_dataInst(7 downto 0) & I_dataInst(7 downto 0);
      O_aluop <= I_dataInst(15 downto 12) & I_dataInst(8);

      case I_dataInst(15 downto 12) is
        when "0111" => 	-- WRITE
          O_regDwe <= '0';
        when "1100" => 	-- JUMP
          O_regDwe <= '0';
        when "1101" => 	-- JUMPEQ
          O_regDwe <= '0';
        when others =>
          O_regDwe <= '1';
      end case;
    end if;
  end process;

end Behavioral;

The opcodes and instruction bit locations should be placed in constants so we can change them easily later. Despite the writes to O_selA, O_aluop, etc, being non-conditional, I still placed them in the clock process. This is so I can see the output latch in the simulator easily when the pipeline transition occurs (more on that in a later post).

That is actually the decoder done. I did say it was super simple!

Additional ISA detail

We have our instruction opcodes, a preliminary instruction set, register set, and instruction forms. We also define (again, for simplicity) that ram is 16 bit addressable. I.e, each address holds 2 bytes, and we will always read/write 16 bits at a time from memory. We can change to byte addressing fairly easily later on if need be, but I’ll keep the restriction to only read/write 16 bits at a time. We’ll keep expanding more on the ISA as we move onto the ALU and discuss implementing the various operations – for instance, we can’t talk about latencies and hazards just yet.

RAM

The RAM, for now, is similar to the register set that was implemented before.

entity ram16 is
    Port ( I_clk : in  STD_LOGIC;
           I_we : in  STD_LOGIC;
           I_addr : in  STD_LOGIC_VECTOR (15 downto 0);
           I_data : in  STD_LOGIC_VECTOR (15 downto 0);
           O_data : out  STD_LOGIC_VECTOR (15 downto 0));
end ram16;

architecture Behavioral of ram16 is
	type store_t is array (0 to 31) of std_logic_vector(15 downto 0);
   signal ram_16: store_t := (others => X"0000");
begin

	process (I_clk)
	begin
		if rising_edge(I_clk) then
			if (I_we = '1') then
				ram_16(to_integer(unsigned(I_addr(5 downto 0)))) <= I_data;
			else
				O_data <= ram_16(to_integer(unsigned(I_addr(5 downto 0))));
			end if;
		end if;
	end process;

end Behavioral;

The RAM as implemented above is 64 bytes, in 32 addressable locations. This can be changed easily by editing the store_t type array bounds, and also the read/write statements in the process – making sure the I_addr bits we need are used. At the moment, the memory space will wrap with locations greater than 31. This simple ram definition gets us very far whilst the rest of the CPU is designed and built. Eventually, I want this CPU using the DRAM on the miniSpartan6+ board itself. A note of caution – start off with a small ram if using the above code. It can take a very long time to build/simulate otherwise.

Wrapping up

That’s about enough for this part. It’s worth noting that currently as I write this (with the CPU actually complete – using this fake ram), I’ve got instruction mnemonics and an assembler written in c# which I use for test case generations using the above instruction forms and opcodes. A good few more parts are needed before we get to talk about that, though!

Thanks for reading, comments as always to @domipheus, and the next part should be looking into the ALU.

Designing a CPU in VHDL, Part 2: Xilinx ISE Suite, register file, testing

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing!

ISE WebPACK Design Software

I’m using the Xilinx ISE WebPack suite of tools for this project. It’s available here for Windows and Linux, for free. Once installed and set up, you can run the project navigator and create a new project. I’ll go through some basic steps here, just for clarity – however this should not be taken as a VHDL primer, I’m assuming some basic knowledge. There are a variety of good VHDL tutorials that will lay down those basics for you.

proj_settingsYou get presented with settings to choose from when you initially create a new project, after giving it a name and description. I’ve set my project up for VHDL and the same chip on my miniSpartan6+ board. Once we accept these settings, we have the IDE ready for getting some VHDL modules added.

Making our VHDL module

In this part we want to concern ourselves with the register file. The registers, for this CPU, will come in the form of 8 16-bit values. I’ll go into more detail in the next part about Instruction Set Architecture choices, but for now we assume that we have a destination register value, and two sources for ALU operations.

reg_fileWith the register file as a black box, we know it must have 3 inputs to it for indicating what registers we want for the destination and two sources. We also need the data output from those sources, and an input for the destination. Finally, we want control line for the write enable functionality of the register file. Sometimes we will not want the destination value to be actually written, so it needs an input to enable the writing of data. The inputs which select the registers will address one of 8, so we need a 3 bit signal for the selection lines. For the data, we know they are 16-bit wide. We will also need a clock input, and an enable bit. With this information fixed, we can create a new VHDL module within our project and detail the ports it exposes.

new_source_stepsISE creates the skeleton for the source automatically. The empty module is as follows.

entity reg16_8 is
Port ( I_clk : in  STD_LOGIC;
       I_en : in  STD_LOGIC;
       I_dataD : in  STD_LOGIC_VECTOR (15 downto 0);
       O_dataA : out  STD_LOGIC_VECTOR (15 downto 0);
       O_dataB : out  STD_LOGIC_VECTOR (15 downto 0);
       I_selA : in  STD_LOGIC_VECTOR (2 downto 0);
       I_selB : in  STD_LOGIC_VECTOR (2 downto 0);
       I_selD : in  STD_LOGIC_VECTOR (2 downto 0);
       I_we : in  STD_LOGIC);
end reg16_8;

architecture Behavioral of reg16_8 is
begin

end Behavioral;

It’s worth noting at this point that I’ve used STD_LOGIC_VECTOR (SLV) types everywhere. Upon posting Part 1 of this series, I had a few folk tell me where possible to use the integer types. A quick Google for more information does show many folk using those instead of the raw SLV types for various reasons. I may go into those later, and revisit the code to re-implement with less SLV usage. For now, however, I intend to just crack on and continue regardless.

Register file logic

The register file is very simple. we need it to do the following

  1. On each clock cycle and if enabled, update the source A and B register outputs given the selection inputs for A and B
  2. On each clock cycle and if enabled, and if the write enable is active, set the internal value of the register D selected to that passed into the dataD input

For this, we will add a process block.

architecture Behavioral of reg16_8 is
begin
  process(I_clk)
  begin
    if rising_edge(I_clk) and I_en='1' then
      -- do our things!
    end if;
  end process;
end Behavioral;

We add I_clk into the process sensitivity list – the parameters after the process keyword. This means this process gets re-evaluated when the state of I_clk changes, which is exactly what you’d expect. The next thing we need to do is define our actual data store, the registers themselves. This is fairly easy and we just define a type, followed by a signal of that type.

architecture Behavioral of reg16_8 is
  type store_t is array (0 to 7) of std_logic_vector(15 downto 0);
  signal regs: store_t := (others => X"0000");
begin
  ..

regs is now an array of 8 SLVs containing 16 bits each – representing our registers. The others statement initializes all bits to 0. Now we just add our logic as per to two main use cases above. We cast our standard_logic_vector inputs to unsigned integers as to index the regs signal array.

process(I_clk)
begin
  if rising_edge(I_clk) and I_en='1' then
    O_dataA <= regs(to_integer(unsigned(I_selA)));
    O_dataB <= regs(to_integer(unsigned(I_selB)));
    if (I_we = '1') then
      regs(to_integer(unsigned(I_selD))) <= I_dataD;
    end if;
  end if;
end process;

At this point its worth checking our syntax using the command listed under Synthesize when the module is selected in the hierarchy within ISE. This will show a few errors, as we are using functionality that you need to include a library for. Thankfully, the generated module has inserted comments near the top of the file indicating all we need is to include the statement ‘use IEEE.NUMERIC_STD.ALL;’ to use these functions. Doing this will allow for our register file to pass syntax checking.

reg_syntax_errornumeric_std_syntax_goodTesting

To test our module, we will create a test bench for it within ISE. Right click the hierarchy, and add new source, of type VHDL Test Bench. I name my tests after what I’m testing, with ‘_tb’ appended. But you can call it anything. Associate the test with the register file module, and a new VHDL file containing a test harness will be created based on it’s definition – ready for you to add in some detail.

reg_tbWhen simulating, you can assign values to the inputs and issue wait statements to account for cycle latencies. ISE automatically generates a clock process for you, so that input is taken care of. If we assign some values to the input, and attempt to write a value to a register D which is also one of the selected register A or B outputs, if we then wait a clock cycle we should see the output for register A change. This test looks as follows:

-- insert stimulus here
I_en &lt;= '1';
I_selA <= "000";
I_selB <= "001";
I_selD <= "000";
I_dataD <= X"FAB5";
I_we <= '1';
wait for I_clk_period;

Now by running the ‘Simulate Behavioural Model’ process when in Simulation view (which you should be now after editing the test bench source file) you get an ISim instance showing a waveform timeline of all the signals. Here we can validate out expected output, and we do get what was expected. After a single clock cycle, the output from register A (the O_dataA signal) becomes 0xFAB5.

sim_reg_tbYou can view the contents of the regs array using the memory pane.

sim_reg_tb_memoryviewWe can extend the test a bit more, to cover some basic situations we know we will get to. Test further writes, make sure we do not write when the write enable is not asserted, writing multiple times to the same register, and finally reading the same register twice on the same clock cycle – useful for nops (using or), doubling values with add and also clearing with xor. The output from the simulator is shown below, with some notes annotated at respective cycles. The full test bench source is at the end of this post.

test_reg_annotated

  1. read r0 and r1, write 0xfab5 to r0
  2. Ensure 0xfab5 appears on data out line, write 0x2222 to r2
  3. Write 0x3333 to r2, testing multiple writes to same location
  4. Set up as though writing 0xfeed to r0, but dont signal write enable
  5. Read r2, ensuring it is 0x3333 not 0x2222. Ensure r1 is 0x0. Ensure r3 and r4 are 0x0.
  6. Write 0x4444 to r4. ensure 0xfeed was not written to r0.
  7. after several cycles, read r4 on both A and B outputs, to test we can read the same register on both ports. Ensure the output is the 0x44 we wrote earlier, on both outputs.

This test shows us the register file should be suitable for our needs!

Moving on

A further test to add would be to also test the enable bit works (as in, when disabled nothing updates) – you’ll have to trust me when I say it does! It’s worth noting there are VHDL asserts, but the truth is despite them compiling fine I’ve not found where any pass/fail/errors appear – even when I’ve forced an assert condition to fail. Maybe someone could help me out in that regard (message @domipheus).

That’s it for this part. Thanks for reading, comments as always to @domipheus, and the next part will be about the ISA and decoder!

The next part in the series is now available.

Source for the test bench used in the annotated simulation is below.

LIBRARY ieee;
USE ieee.std_logic_1164.ALL;

-- Uncomment the following library declaration if using
-- arithmetic functions with Signed or Unsigned values
--USE ieee.numeric_std.ALL;

ENTITY reg16_8_tb IS
END reg16_8_tb;

ARCHITECTURE behavior OF reg16_8_tb IS 

    -- Component Declaration for the Unit Under Test (UUT)

    COMPONENT reg16_8
    PORT(
         I_clk : IN  std_logic;
         I_en : IN  std_logic;
         I_dataD : IN  std_logic_vector(15 downto 0);
         O_dataA : OUT  std_logic_vector(15 downto 0);
         O_dataB : OUT  std_logic_vector(15 downto 0);
         I_selA : IN  std_logic_vector(2 downto 0);
         I_selB : IN  std_logic_vector(2 downto 0);
         I_selD : IN  std_logic_vector(2 downto 0);
         I_we : IN  std_logic
        );
    END COMPONENT;

   --Inputs
   signal I_clk : std_logic := '0';
   signal I_en : std_logic := '0';
   signal I_dataD : std_logic_vector(15 downto 0) := (others => '0');
   signal I_selA : std_logic_vector(2 downto 0) := (others => '0');
   signal I_selB : std_logic_vector(2 downto 0) := (others => '0');
   signal I_selD : std_logic_vector(2 downto 0) := (others => '0');
   signal I_we : std_logic := '0';

  --Outputs
   signal O_dataA : std_logic_vector(15 downto 0);
   signal O_dataB : std_logic_vector(15 downto 0);

   -- Clock period definitions
   constant I_clk_period : time := 10 ns;

BEGIN

  -- Instantiate the Unit Under Test (UUT)
   uut: reg16_8 PORT MAP (
          I_clk => I_clk,
          I_en => I_en,
          I_dataD => I_dataD,
          O_dataA => O_dataA,
          O_dataB => O_dataB,
          I_selA => I_selA,
          I_selB => I_selB,
          I_selD => I_selD,
          I_we => I_we
        );

   -- Clock process definitions
   I_clk_process :process
   begin
    I_clk <= '0';
    wait for I_clk_period/2;
    I_clk <= '1';
    wait for I_clk_period/2;
   end process;

   -- Stimulus process
   stim_proc: process
   begin
      -- hold reset state for 100 ns.
      wait for 100 ns;	

      wait for I_clk_period*10;

      -- insert stimulus here 

    I_en <= '1';

    -- test for writing.
    -- r0 = 0xfab5
    I_selA <= "000";
    I_selB <= "001";
    I_selD <= "000";
    I_dataD <= X"FAB5";
    I_we <= '1';
      wait for I_clk_period;

    -- r2 = 0x2222
    I_selA <= "000";
    I_selB <= "001";
    I_selD <= "010";
    I_dataD <= X"2222";
    I_we <= '1';
      wait for I_clk_period;

    -- r3 = 0x3333
    I_selA <= "000";
    I_selB <= "001";
    I_selD <= "010";
    I_dataD <= X"3333";
    I_we <= '1';
      wait for I_clk_period;

    --test just reading, with no write
    I_selA <= "000";
    I_selB <= "001";
    I_selD <= "000";
    I_dataD <= X"FEED";
    I_we <= '0';
      wait for I_clk_period;

    --at this point dataA should not be 'feed'

    I_selA <= "001";
    I_selB <= "010";
      wait for I_clk_period;

    I_selA <= "011";
    I_selB <= "100";
      wait for I_clk_period;

    I_selA <= "000";
    I_selB <= "001";
    I_selD <= "100";
    I_dataD <= X"4444";
    I_we <= '1';
      wait for I_clk_period;

    I_we <= '0';
      wait for I_clk_period;

    -- nop
      wait for I_clk_period;

    I_selA <= "100";
    I_selB <= "100";
      wait for I_clk_period;

      wait;
   end process;

END;

Designing a CPU in VHDL, Part 1: Rationale, tools, method

Why design my own CPU, with associated ISA, assembler and other tools?

Because, I can! Why not? I’ll learn a load of stuff!

The above is the fundamental reason for this series of posts. As a software developer, and in particular, a compiler/debugger engineer, you are exposed to low level architectural details, latencies, hazards and of course, hardware bugs. In the past I’ve been part of teams who have been able to feedback details of architectural quirks that, if modified, can improve throughput in certain workloads – sometimes, completely new features have been added to hardware due to feedback. However, as a software engineer, you are limited in exposure to what that actually boils down to at the hardware level. It’s an area of computer science which fascinates me and I’d very like to get more involved in. So, a few years ago I downloaded the Xilinx ISE webpack software and started to learn VHDL – a hardware description language (HDL).

VHDL, really, is simple. It’s the safer choice when it comes to HDL – going by what I’ve read. Verilog is the C99 of the HDL world, and you can get in quite a mess as a beginner if you don’t understand it well enough. So, starting as a novice to HDL concepts, VHDL was the obvious choice.

minispartan6Last year, I backed the miniSpartan6+ FPGA Kickstarter project. I now have the end product at home, based on the Xilinx LX25 Spartan6 FPGA. It’s a nice little board, and I’ve managed to flash a small hello world style LED blinker to it successfully. You can get many other types of board (even from Amazon), and entry level ones are pretty affordable.

Over the past two months, I’ve spent my train journey into work designing and implementing a very small 16-bit CPU. I’ve codenamed it TPU, for Test Processing Unit. In the next series of posts, I will be explaining how I’ve gone from an empty VHDL source file to a project which runs code processed through my c# assembler within the Xilinx ISim simulator. As I write this now, the project is running code under simulation with basic arithmetic operations, addition, branching and memory access. The end goal is to fix some issues identified during simulation, and get it on the miniSpartan6+ hardware.

I hope to learn much along the way whilst writing these articles. If you are an experienced hardware engineer and see me doing something a) stupid, b) inefficient, c) unwise or d) stupid, please do tell me by ways of twitter at @domipheus. Efficiency isn’t a goal of this, but I’d still like to know!

arrrrghinterconnectspaghettiIn many ways, the TPU can be classed the Terrible Processing Unit – but it needs to be this way. It is worth noting I’ve tried making a CPU design before, but always got into a spaghetti mess by trying to do too much, and not knowing the underlying gotchas of how to link all the internal components together.

This is not a superscalar processor. This will be a CPU that takes multiple instructions to execute the simplest of instructions. It’s aimed at a level to educate myself, and hopefully at a level others can gain knowledge from.

So, what we want from/in the TPU:

  • A 16-bit CPU core
  • At the start using synthesized ram but hopefully later using the SDRAM on the miniSpartan6+ board
  • An in-order CPU with no real pipelining as such
  • Basic arithmetic operations, shifts, additions
  • Branching, including conditional branches
  • A register file
  • A control unit to keep everything in lock-step
  • Design an ISA and create a small assembler
  • Ultimately, be an educational project to learn from

The tools I have used are the free (yes, I find it quite insane too) Xilinx WebPack tools. It comes with a very nice IDE, and associated toolchain – including a simulator. The other tools I will use are those for creating the assembler (Visual Studio C# – as with most of my hobby coding these days) and the tools to load files onto the miniSpartan6+ board.

As for a method, I’m just winging it. If problems appear, they will be solved.

The next part will be about implementing the ram and register file, and testing it in the simulator. Then I’ll discuss the ISA, in preparation for looking at the decoder. For now, here is a spoiler of the TPU in some simulator action, bonus points goes to the people who realise there is an odd thing about the form of the baz (branch if Ra is zero) instruction.

mul_sim

Thanks for reading, send all comments to @domipheus on twitter!

Ohh, and before I forget, this willl be completely open source. I’ll throw it up on github soon!

Part 2 in this series is now available.

Teensy Z80 Homebrew Computer – Part 6 – Asynchronous Clocking Fail

This is the sixth part of a series of posts detailing steps required to get a simple Z80 based computer running, facilitated by a Teensy microcontroller. It’s a bit of fun, fuzing old and new hobbyist technologies. See Part 1, Part 2, Part 3, Part 4, and Part 5, if you’ve missed them.

Attempt 1

Making TeensyZ80 run with a faster, asynchronous clock seems a simple change at first, but it’s proving tricky. The high level plan is:

  1. The Clock signal is provided by another source (arduino nano at present)
  2. The MREQ and IOREQ lines are used to latch the WAIT line of the Z80 to allow the Teensy to respond to the request.
  3. The Teensy senses the WAIT line, performs any actions, and then resets the latch to bring the WAIT line high again (it’s active low).
  4. The Z80 continues as normal.

So with some simple 74 series logic, the MREQ and IOREQ pins are NAND’d together, producing a rising signal edge if either Z80 output go active low. This is fed into a 74HC74 flip flop as it’s clock, with the data pin tied logic high. This allows us to connect the Z80 WAIT input to the notQ output. The clear pin of the d-type flip flop is connected to the Teensy so it can reset it and allow the WAIT line to return high, letting the Z80 continue.

I had the Teensy set up to perform an interrupt routine on a falling edge of WAIT. Sadly, this didn’t seem to work. In fact, I could not confirm the interrupt was being called at all. I’ll have to look into this in detail but using interrupts really is an optimization in this case, so I soldiered on.

Teensy Rant

I’ve had several problems with Teensy microcontrollers during these posts. I had two units, one has completely bricked, and the other is very unstable. It seems to be due to the fact that if Pin 33 is low and an input when a program is uploaded to the Arm then the Mini54 chip can fail in some way. The Mini54 chip controls the bootloading process of uploading new code, so it effectively bricks the device. It is an issue that should really be given more prominence as if there was an announcement stating pin33 should never be used in certain ways I’d have two fully working Teensy devices. But sadly, all the documentation still states it as a fully configurable digital pin capable of input and output.

End Teensy Rant

Instead of using an interrupt, to try to get something working I created a tight loop() function that didn’t do anything while WAIT was high. As soon as it detected a low signal, it would perform the actions required. I disabled Z80 mode-2 interrupts for now, and removed the I/O debounce code. A very simple example seemed to work – but it was still quite slow, despite an arduino nano driving a clock at around 200KHz which is faster than what the Teensy was providing when running in synchronous mode.

I tried a larger example, one which printed text to the console, and it was obvious something was not right – the output slowly became corrupted. However, there were signs of promise. I was able to input a 4MHz clock and things were failing/corrupting in a somewhat similar way. Still corrupt, but it was the same behavior.

4mhzThe problem

The issue was that I failed to include the RD/WR lines from the Z80 in my latching circuit. You can see from the timing diagram that, especially in the write cases we need to WAIT when those are active too, not just MREQ or IOREQ.

timing_memoryAttempt 2

I redesigned the latch circuit.

wait_line_latchThis worked a lot better! I could only use the I/O port which put characters to the screen, but it was running well – and my simple test program, which printed “Welcome to TeensyZ80!” in an infinite loop, was stable even at 1MHz. I’d love to break the MHz barrier for this, but given we’re still on a breadboard and I don’t have a scope capable of inspecting this to the detail some of the issues require, I may need to settle for much less. So this simple test at 1MHz is very encouraging. I tried clocking it at 1.5MHz, but some artifacts in the printing arose.

latchcircuitThe previous design with I/O ports

When implementing my serial, display and filesystem devices which are accessed as I/O reads/writes, I created a system which relied on implied state behind the scenes on the Teensy. To set the colour of the characters being printed to the screen, two writes to the same port would write high and low values. It’s even worse for the serial device, where you had to write command packets to the I/O ports followed by a variable amount of data. I think i’m going to need to redesign all of the previous work, to operate on separate ports. For example, there will be a ColourHi and colourLow port which together define the 16-bit colour of the console. It’s not much work, but is something I’d overlooked and will take time.

This is a very quick update to Teensy Z80 work, It’s still very much ongoing. I’m also working on another project involving the miniSpartan6+ FPGA board. That’s another bit of fun – who doesn’t want to design their own processor?

Let me know any thoughts, as always, via twitter @domipheus !