This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.
We already have a small RAM which holds our instruction stream, but our TPU ISA defines memory read and write instructions, and we should get those instructions working.
It’s the last major functional implementation we need to complete.
The fetch stage is simply a memory read with the PC on our address bus. It gives a cycle of latency to allow for our instruction to appear on the data out bus of the RAM, ready for decoding. When we encounter a memory ALU operation, we need the control unit to activate the memory stage of the pipeline, which sits after Execute and before Writeback. The way we want this implemented is that the ALU calculates the memory address during execute, and that address is read during the memory stage, and the data passed to the register file during writeback. For a memory write, the ALU calculates the address, and the data we want to write is always on the dataB bus output from the register file, so we connect that up to the memory input bus.
The control unit is modified to add in the memory stage, and also take the ALU operation as an input to do that check. You can see the new unit here.
The Memory Subsystem
Because we now touch memory in multiple pipeline stages, we need to start routing our signals and selecting destinations depending on the current control state. There are various signal inputs that now come from multiple sources:
- Register File data input needs to be either dataResult from ALU, or dataReadOutput(ramRData) from memory – when a memory read.
- The Instruction Decoder needs connected to the dataReadOutput(ramRData) from memory, as the decoder only decodes during the correct pipeline stage, we don’t care that the input may be different – as long as the instruction data is correct at the decode stage.
- The memory write bit needs to know when we are performing a memory write instruction, and not a read.
- Memory writes also need to assign the dataWriteInput(ramWData) port with the data we need – contents of the rB register.
- The Address sent to the memory needs to be the current PC during fetch, and dataResult when a memory operation.
We can try this without making another functional unit, by just doing some assignments in our test bench source.
ramAddr <= dataResult when en_memory = '1' else PC; ramWData <= dataB; ramWE <= '1' when en_memory = '1' and aluop(4 downto 1) = OPCODE_WRITE else '0'; registerWriteData <= ramRData when en_regwrite = '1' and aluop(4 downto 1) = OPCODE_READ else dataResult; instruction <= ramRData;
we use our existing test bench, with our additional memory system signals. We have a new test instruction stream which we have loaded into the memory which looks like this:
signal ram: store_t := ( OPCODE_XOR & "000" & '0' & "000" & "000" & "00", OPCODE_LOAD & "001" & '1' & X"0f", OPCODE_LOAD & "010" & '1' & X"0e", OPCODE_LOAD & "110" & '1' & X"0b", OPCODE_READ & "100" & '0' & "010" & "100" & "00", OPCODE_READ & "101" & '0' & "001" & "100" & "00", OPCODE_SUB & "101" & '0' & "101" & "100" & "00", OPCODE_WRITE & "000" & '0' & "001" & "101" & "00", OPCODE_CMP & "111" & '0' & "101" & "101" & "00", OPCODE_JUMPEQ & "000" & '0' & "111" & "110" & "01", OPCODE_JUMP & "000" & '1' & X"05", OPCODE_JUMP & "000" & '1' & X"0b", X"0000", X"0000", X"0001", X"0006" );
Which, in TPU assembly resembles:
xor r0, r0, r0 load.l r1, 0x0f load.l r2, 0x0e load.l r6, $fin read r4, r2 loop: read r5, r1 sub.u r5, r5, r4 write r1, r5 cmp.u r7, r5, r5 jaz r7, r6 jump $loop fin: jump $fin .loc 0x0e data 0x0001 .loc 0x0f data 0x0006
This means we expect to see 0x0000 in the memory location 0x0f after 6 iterations of the loop. From the waveform we can see computation finishes within the simulation time. We can go into the memory view of ISim and we see the result is in the correct place.
This simulation works with one cycle of memory latency, when using our embedded RAM. If we wanted to go to an external ram such as the DRAM on miniSpartan6+, we’d need to introduce multiple cycles of latency. For this, we should stall the pipeline whilst memory operations complete. We won’t go into that just now, as I think we need to take a step back, and look at the top level view of TPU and try to get what we have on an FPGA.
Top level view
With everything built to date, we can see a pretty general outline of a CPU, with the various control lines, data lines, selects, etc. With this implemented as a black box ‘core’, we can try to implement our CPU in such a way that we can view a working test on actual miniSpartan6+ hardware.
Creating a top level block for FPGA hardware
The miniSpartan6+ board has 4 switches and 8 LEDs. The top-level block I created has the clock input, the 4 switch inputs and the 8 LED outputs. I still used the embedded RAM. The code within this block resembles the test bench, except there is a process for detecting when the RAM address line is 0x1000 and writing the data to the LED output pins. I use one of the switch inputs to drive the reset line, which actually doesn’t reset the CPU – it simply resets the control unit. As our registers do not get reset, execution continues once reset is deactivated with some existing state present.
The top level entity definition looks like the following:
entity leds_switch_test_expand is Port ( I_clk : in STD_LOGIC; I_switch : in STD_LOGIC_VECTOR (3 downto 0); O_leds : out STD_LOGIC_VECTOR (7 downto 0)); end leds_switch_test_expand;
And pretty much everything remains the same as the simulation test bench, except we no longer use the simulated clock, and we hack in our LED memory mapping:
process(I_clk, O_address) begin if rising_edge(I_clk) then if (O_address = X"1000") then leds <= dataB(7 downto 0); end if; end if; end process; O_leds <= leds(7 downto 1) & I_reset; I_reset <= I_switch(0);
As you can see, I use the first led to indicate the state of the reset line, which is useful.
With this new top level entity, we can create a test bench and write a very small code example to write a counter to the LED memory location. The code example below simulates and we see the LED output change. I force initialize the LEDs signal to a known good value as a debugging aid.
load.l r0, 0x01 load.l r1, 0x01 load.h r6, 0x10 loop: write r6, r0 add.u r0, r0, r1 jump $loop
Using the miniSpartan6+ board from Windows
I use the exact same method to get the .bit programming files onto the FPGA. This method needs done every time you power the FPGA – it doesn’t write the flash, which would allow for the FPGA design to remain across power resets. Getting that working is for another day.
As explained in the bringup guide, we need to create a ‘User Constraints File’ which at a simple level maps the input and outputs of our entity to real pins on the board. Looking at the miniSpartan6+ schematic we can see what pins are connected where, for example LED6 is connected to the ‘location’ P7.
NET "I_clk" PERIOD = 20 ns | LOC = "K3"; NET "O_LEDS<0>" LOC="P11" | IOSTANDARD=LVTTL | DRIVE=8 | SLEW=SLOW; NET "O_LEDS<1>" LOC="N9" | IOSTANDARD=LVTTL | DRIVE=8 | SLEW=SLOW; NET "O_LEDS<2>" LOC="M9" | IOSTANDARD=LVTTL | DRIVE=8 | SLEW=SLOW; NET "O_LEDS<3>" LOC="P9" | IOSTANDARD=LVTTL | DRIVE=8 | SLEW=SLOW; NET "O_LEDS<4>" LOC="T8" | IOSTANDARD=LVTTL | DRIVE=8 | SLEW=SLOW; NET "O_LEDS<5>" LOC="N8" | IOSTANDARD=LVTTL | DRIVE=8 | SLEW=SLOW; NET "O_LEDS<6>" LOC="P8" | IOSTANDARD=LVTTL | DRIVE=8 | SLEW=SLOW; NET "O_LEDS<7>" LOC="P7" | IOSTANDARD=LVTTL | DRIVE=8 | SLEW=SLOW; NET "I_SWITCH<0>" LOC="L1" | IOSTANDARD=LVTTL | PULLUP; NET "I_SWITCH<1>" LOC="L3" | IOSTANDARD=LVTTL | PULLUP; NET "I_SWITCH<2>" LOC="L4" | IOSTANDARD=LVTTL | PULLUP; NET "I_SWITCH<3>" LOC="L5" | IOSTANDARD=LVTTL | PULLUP;
The PULLUP parts of the I_SWITCH definitions is very important. My first try at creating this file (before I found the full UCF file on github) omitted the PULLUP, which was never going to work.
Without the PULLUP, regardless of the switch position, we’ll never get logic ‘1’ at the input. The hatched box happens inside the FPGA, pulling the value to ‘1’ when the switch is not connected to ground. Which is what you want!
Now we have our UCF file done, we want to build our ‘Programming File’ which gets uploaded to our FPGA. We make our entity the top module by right clicking it within Implementation mode and selection the option. This unlocks the synthesis options, and we run the ‘Generate Programming File’ option. This can take some time, and will raise warnings, but it completes without error. The steps taken to generate the file are below (taken from Xilinx tutorials)
Synthesis – ‘compiles’ the HDL into netlists and other structures
Translate – merges the incoming netlists and constraints into a Xilinx® design file.
Map – fits the design into the available resources on the target device, and optionally, places the design.
Place and Route – places and routes the design to the timing constraints.
Generate Programming File – creates a bitstream file that can be downloaded to the device.
The first time I flashed the FPGA, I was stumped as to why the LEDS were remaining on (apart from the reset LED). Then it became obvious. The clock input is 50MHz. There is no way, with the CPU running that fast, we can see the LEDs change!
Screw telling the difference between 30 and 60fps. LEDs blinking at ~10MHz is a pretty stupid visual test
— Colin Riley (@domipheus) July 29, 2015
I solved this by adding a frequency divider into the VHDL. The 50MHz I_clk from the ‘outside world’ is slowed down using a very simple module, which basically counts and uses a bit high up the counter as an output clock. This clock output is then what’s fed into the TPU functional units such as the decoder, as the core_clock in the design. The frequency divider is as follows:
entity clock_divider is port ( clk: in std_logic; reset: in std_logic; clock_out: out std_logic); end clock_divider; architecture Behavioral of clock_divider is signal scaler : std_logic_vector(23 downto 0) := (others => '0'); begin process(clk) begin if rising_edge(clk) then -- rising clock edge scaler <= std_logic_vector( unsigned(scaler) + 1); end if; end process; clock_out <= scaler(16); end Behavioral;
Using that divider, it works, and we get counting LEDS!
I’ll put the full example top module on github (soon!) as an example, but there is more work to be done in getting it a bit more robust, making the memory mapping actually really mapped (at the moment, a write still actually happens in the RAM but we don’t care or break on it).
For now, it’s pretty cool to see code actually running on a TPU on the FPGA hardware. Additionally, it only uses 3% of the slice resources of the LX25 Spartan6 FPGA, so lots more space to do other things with!
Thanks for reading, comments as always to @domipheus.