Designing a CPU in VHDL, Part 7: Memory Operations, Running on FPGA

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

Memory Operations

We already have a small RAM which holds our instruction stream, but our TPU ISA defines memory read and write instructions, and we should get those instructions working.

It’s the last major functional implementation we need to complete.

pipe7The fetch stage is simply a memory read with the PC on our address bus. It gives a cycle of latency to allow for our instruction to appear on the data out bus of the RAM, ready for decoding. When we encounter a memory ALU operation, we need the control unit to activate the memory stage of the pipeline, which sits after Execute and before Writeback. The way we want this implemented is that the ALU calculates the memory address during execute, and that address is read during the memory stage, and the data passed to the register file during writeback. For a memory write, the ALU calculates the address, and the data we want to write is always on the dataB bus output from the register file, so we connect that up to the memory input bus.

The control unit is modified to add in the memory stage, and also take the ALU operation as an input to do that check. You can see the new unit here.

The Memory Subsystem

Because we now touch memory in multiple pipeline stages, we need to start routing our signals and selecting destinations depending on the current control state. There are various signal inputs that now come from multiple sources:

  1. Register File data input needs to be either dataResult from ALU, or dataReadOutput(ramRData) from memory – when a memory read.
  2. The Instruction Decoder needs connected to the dataReadOutput(ramRData) from memory, as the decoder only decodes during the correct pipeline stage, we don’t care that the input may be different – as long as the instruction data is correct at the decode stage.
  3. The memory write bit needs to know when we are performing a memory write instruction, and not a read.
  4. Memory writes also need to assign the dataWriteInput(ramWData) port with the data we need – contents of the rB register.
  5. The Address sent to the memory needs to be the current PC during fetch, and dataResult when a memory operation.

We can try this without making another functional unit, by just doing some assignments in our test bench source.

ramAddr <= dataResult when en_memory = '1' else PC;
ramWData <= dataB;
ramWE <= '1' when en_memory = '1' and aluop(4 downto 1) = OPCODE_WRITE else '0';

registerWriteData <= ramRData when en_regwrite = '1' and aluop(4 downto 1) = OPCODE_READ else dataResult;
instruction <= ramRData;

Simulation

we use our existing test bench, with our additional memory system signals. We have a new test instruction stream which we have loaded into the memory which looks like this:

signal ram: store_t := (
  OPCODE_XOR & "000" & '0' & "000" & "000" & "00",
  OPCODE_LOAD & "001" & '1' & X"0f",
  OPCODE_LOAD & "010" & '1' & X"0e",
  OPCODE_LOAD & "110" & '1' & X"0b",
  OPCODE_READ & "100" & '0' & "010" & "100" & "00",
  OPCODE_READ & "101" & '0' & "001" & "100" & "00",
  OPCODE_SUB & "101" & '0' & "101" & "100" & "00",
  OPCODE_WRITE & "000" & '0' & "001" & "101" & "00",
  OPCODE_CMP & "111" & '0' & "101" & "101" & "00",
  OPCODE_JUMPEQ & "000" & '0' & "111" & "110" & "01",
  OPCODE_JUMP & "000" & '1' & X"05",
  OPCODE_JUMP & "000" & '1' & X"0b",
  X"0000",
  X"0000",
  X"0001",
  X"0006"
);

Which, in TPU assembly resembles:

  xor r0, r0, r0
  load.l r1, 0x0f
  load.l r2, 0x0e
  load.l r6, $fin
  read r4, r2
loop:
  read r5, r1
  sub.u r5, r5, r4
  write r1, r5
  cmp.u r7, r5, r5
  jaz r7, r6
  jump $loop
fin:
  jump $fin

  .loc 0x0e
  data 0x0001
  .loc 0x0f
  data 0x0006

This means we expect to see 0x0000 in the memory location 0x0f after 6 iterations of the loop. From the waveform we can see computation finishes within the simulation time. We can go into the memory view of ISim and we see the result is in the correct place.

first_simThis simulation works with one cycle of memory latency, when using our embedded RAM. If we wanted to go to an external ram such as the DRAM on miniSpartan6+, we’d need to introduce multiple cycles of latency. For this, we should stall the pipeline whilst memory operations complete. We won’t go into that just now, as I think we need to take a step back, and look at the top level view of TPU and try to get what we have on an FPGA.

Top level view

highlevelWith everything built to date, we can see a pretty general outline of a CPU, with the various control lines, data lines, selects, etc. With this implemented as a black box ‘core’, we can try to implement our CPU in such a way that we can view a working test on actual miniSpartan6+ hardware.

Creating a top level block for FPGA hardware

minispartan6The miniSpartan6+ board has 4 switches and 8 LEDs. The top-level block I created has the clock input, the 4 switch inputs and the 8 LED outputs. I still used the embedded RAM. The code within this block resembles the test bench, except there is a process for detecting when the RAM address line is 0x1000 and writing the data to the LED output pins. I use one of the switch inputs to drive the reset line, which actually doesn’t reset the CPU – it simply resets the control unit. As our registers do not get reset, execution continues once reset is deactivated with some existing state present.

The top level entity definition looks like the following:

entity leds_switch_test_expand is
  Port ( I_clk : in  STD_LOGIC;
         I_switch : in  STD_LOGIC_VECTOR (3 downto 0);
         O_leds : out  STD_LOGIC_VECTOR (7 downto 0));
end leds_switch_test_expand;

And pretty much everything remains the same as the simulation test bench, except we no longer use the simulated clock, and we hack in our LED memory mapping:

process(I_clk, O_address)
begin
  if rising_edge(I_clk) then
    if (O_address = X"1000") then
      leds <= dataB(7 downto 0);
    end if;
  end if;
end process;

O_leds <= leds(7 downto 1) & I_reset;
I_reset <= I_switch(0);

As you can see, I use the first led to indicate the state of the reset line, which is useful.

With this new top level entity, we can create a test bench and write a very small code example to write a counter to the LED memory location. The code example below simulates and we see the LED output change. I force initialize the LEDs signal to a known good value as a debugging aid.

  load.l r0, 0x01
  load.l r1, 0x01
  load.h r6, 0x10
loop:
  write r6, r0
  add.u r0, r0, r1
  jump $loop

leds_test_simNow we need to look at how we get this VHDL design actually onto the hardware.

Using the miniSpartan6+ board from Windows

There is a great guide for getting the board running from Michael Field who runs the hamsterworks.co.nz wiki. You should give it a visit! The page in question is the miniSpartan6+ bringup.

I use the exact same method to get the .bit programming files onto the FPGA. This method needs done every time you power the FPGA – it doesn’t write the flash, which would allow for the FPGA design to remain across power resets. Getting that working is for another day.

As explained in the bringup guide, we need to create a ‘User Constraints File’ which at a simple level maps the input and outputs of our entity to real pins on the board. Looking at the miniSpartan6+ schematic we can see what pins are connected where, for example LED6 is connected to the ‘location’ P7.

switch_led_schematic_pinsThere is a full UCF available for the miniSpartan6+ here[https://github.com/scarabhardware/miniSpartan6-plus/blob/master/projects/miniSpartan6-plus.ucf], and we can use a subset of it for our uses.

NET "I_clk" PERIOD = 20 ns | LOC = "K3";

NET "O_LEDS<0>" LOC="P11" | IOSTANDARD=LVTTL | DRIVE=8 | SLEW=SLOW;
NET "O_LEDS<1>" LOC="N9"  | IOSTANDARD=LVTTL | DRIVE=8 | SLEW=SLOW;
NET "O_LEDS<2>" LOC="M9"  | IOSTANDARD=LVTTL | DRIVE=8 | SLEW=SLOW;
NET "O_LEDS<3>" LOC="P9"  | IOSTANDARD=LVTTL | DRIVE=8 | SLEW=SLOW;
NET "O_LEDS<4>" LOC="T8"  | IOSTANDARD=LVTTL | DRIVE=8 | SLEW=SLOW;
NET "O_LEDS<5>" LOC="N8"  | IOSTANDARD=LVTTL | DRIVE=8 | SLEW=SLOW;
NET "O_LEDS<6>" LOC="P8"  | IOSTANDARD=LVTTL | DRIVE=8 | SLEW=SLOW;
NET "O_LEDS<7>" LOC="P7"  | IOSTANDARD=LVTTL | DRIVE=8 | SLEW=SLOW;

NET "I_SWITCH<0>"   LOC="L1" | IOSTANDARD=LVTTL | PULLUP;
NET "I_SWITCH<1>"   LOC="L3" | IOSTANDARD=LVTTL | PULLUP;
NET "I_SWITCH<2>"   LOC="L4" | IOSTANDARD=LVTTL | PULLUP;
NET "I_SWITCH<3>"   LOC="L5" | IOSTANDARD=LVTTL | PULLUP;

The PULLUP parts of the I_SWITCH definitions is very important. My first try at creating this file (before I found the full UCF file on github) omitted the PULLUP, which was never going to work.

pullupWithout the PULLUP, regardless of the switch position, we’ll never get logic ‘1’ at the input. The hatched box happens inside the FPGA, pulling the value to ‘1’ when the switch is not connected to ground. Which is what you want!

generate_prog_fileNow we have our UCF file done, we want to build our ‘Programming File’ which gets uploaded to our FPGA. We make our entity the top module by right clicking it within Implementation mode and selection the option. This unlocks the synthesis options, and we run the ‘Generate Programming File’ option. This can take some time, and will raise warnings, but it completes without error. The steps taken to generate the file are below (taken from Xilinx tutorials)

Synthesis – ‘compiles’ the HDL into netlists and other structures
Translate – merges the incoming netlists and constraints into a Xilinx® design file.
Map – fits the design into the available resources on the target device, and optionally, places the design.
Place and Route – places and routes the design to the timing constraints.
Generate Programming File – creates a bitstream file that can be downloaded to the device.

First Flash

The first time I flashed the FPGA, I was stumped as to why the LEDS were remaining on (apart from the reset LED). Then it became obvious. The clock input is 50MHz. There is no way, with the CPU running that fast, we can see the LEDs change!

Frequency Divider

I solved this by adding a frequency divider into the VHDL. The 50MHz I_clk from the ‘outside world’ is slowed down using a very simple module, which basically counts and uses a bit high up the counter as an output clock. This clock output is then what’s fed into the TPU functional units such as the decoder, as the core_clock in the design. The frequency divider is as follows:

entity clock_divider is
port (
	clk: in std_logic;
	reset: in std_logic;
	clock_out: out std_logic);
end clock_divider;

architecture Behavioral of clock_divider is
  signal scaler : std_logic_vector(23 downto 0) := (others => '0');
begin

  process(clk)
  begin
    if rising_edge(clk) then   -- rising clock edge
        scaler <= std_logic_vector( unsigned(scaler) + 1);
    end if;
  end process;

clock_out <= scaler(16);

end Behavioral;

Using that divider, it works, and we get counting LEDS!

Wrapping Up

I’ll put the full example top module on github (soon!) as an example, but there is more work to be done in getting it a bit more robust, making the memory mapping actually really mapped (at the moment, a write still actually happens in the RAM but we don’t care or break on it).

For now, it’s pretty cool to see code actually running on a TPU on the FPGA hardware. Additionally, it only uses 3% of the slice resources of the LX25 Spartan6 FPGA, so lots more space to do other things with!

Thanks for reading, comments as always to @domipheus.

Comments are closed.