This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.
It’s finally that time! I have committed the latest TPU VHDL, assembler and ISA to my github repository. Fair warning: The assembler is _horrid_. The VHDL contains a ISE project for the LX25 miniSpartan6+ variant, but the font ROM is empty. The font ROM I’ve been testing with doesn’t have a license attached and I don’t want to blindly add it here. You can, however simulate with ISim, and directly inspect the Text Ram blocks in ASCII mode to see any text displayed. I will explain that more later in the part.
The ‘BIOS’ code
So just now we have a TPU program which prints a splash message, checks for the size of the contiguous memory area from 0x0000 (slowly), and then waits for a command from the input UART. This UART is connected to the FTDI chip on port B, so appears as a COM port to a computer when the USB cable is connected to the miniSpartan6+. It only accepts a single character command just now, mainly because I have chosen the path of progress here rather than the path of lovely command-line words, which involves several string functions that honestly nobody should need to waste time writing (yes, looking at you, tech interviewers).
Getting to this point, without even writing significant code to handle a command, I realized that the code was so big (~1.5KB) that I’d have trouble fitting it into a single block ram. TASM, the Tpu ASseMbler, currently only outputs a single block ram initializer, and the VDHL duplicates that single intializer across each memory bank, so it would be a lot of work to fix all of that. I instead wanted to look at why exactly there is so much code, for such little functionality.
I edited TASM to output instruction metrics, simply a count for each operation. I then checked what was the biggest, ignoring the define word (dw) operation, as string constants were significant but not really relevant. This was the result:
So you look at that and scan for counts that don’t make any sense, there are a few that jump right out. For instance, Branch Immediate (bi) only used twice in all this code? 1400 lines of assembly and only 2 branch immediates?
When investigating the code, I realized why. The 16 bit op codes don’t leave a lot of room for immediate values, even if they are shifted to account for 2-byte alignment of instructions. To play it safe, I was instead loading into the high and low bytes of registers, and branching to a register. So 4 instructions (load.l, load.h, or, br) instead of a single branch immediate. There were also lots of function calls, and those branches were performed using Branch Register.
More puzzling was the differences in write.w and read.w count. 111 write words, versus 38 read words? Given most of the reads and writes in the code was stack manipulation for the various function calls, it seemed like there was significant overspill saving registers that were then not required to be read (basically, myself being over cautious).
So, I decided to perform two changes to the ISA:
First, I would remove the Branch Immediate instruction, replacing it with a Branch Immediate Relative Offset. This will allow for the largest possible range, and help with eventual Position Independent Code, as the immediate is a signed offset from the current program counter. Perfect for use when jumping around inside functions.
Second, I will introduce an interrupt instruction, int. This will allow for the software-defined invocation of the interrupt handler, with an immediate value being provided to the Interrupt Event Field. Then I will be able to replicate the old IBM/DOS BIOS routines – where interrupts were signaled to perform low-level tasks.
So, here are the two new instructions. They are part of ISA 1.5, which is on github:
Currently, my code sets up a stack at a known high address. This could then be reset to another location after the memory test, but isn’t done just yet. The stack is pointed to by r7, and currently I’ve defined all other registers r0-r6 to be volatile for the purpose of calling conventions. This simply means that if you need the value of a register preserved across a function call, you must save the value onto the stack before you call. I mentioned function calling and the instructions used in a previous part, but here is a reminder of how it’s done:
# Call setcursor (9, 2) subi r7, r7, 2 #reserve stack for preserve write.w r7, r0, 0 #we want to preserve r0 subi r7, r7, 6 #reserve stack for call load.l r1, 9 #2 word arguments and a return address = 6 bytes load.l r2, 2 write.w r7, r1, 2 #arg0 x=9 write.w r7, r2, 4 #arg1 y=2 spc r6 #get the current PC value addi r6, r6, 14 #offset that value past the branch below write.w r7, r6 #save the return pc load.h r0, $setcursor.h load.l r1, $setcursor.l or r0, r0, r1 #build a register containing function location br r0 #branch to the function label addi r7, r7, 6 #restore stack read.w r0, r7, 0 #read the old r0 back addi r7, r7, 2 #restore stack (preserve)
So lots of writes and bloat for every call, but it works well enough. I have return values passed back in registers. By implementing BIOS routines for these I/O functions using software interrupts, we should reduce the code size considerably.
The new int instruction allows user code to jump into the interrupt handler, with a user-defined Interrupt Event Field. The space in the immediate area of the opcode allows for 64 values, so we define Interrupt Event Field values less than 64 to be only applicable to software routines.
In the interrupt handler, all registers except r7 are immediately saved to the stack. Of course this means the stack must always have enough space for this, but assume that is true for now. If the Event Field is < 64, we use this value to get a function address from a table. We then setup a function call and branch. On return, we save r0 back to the stack, directly where it will be restored from. This allows for data to be returned from BIOS function calls. Currently, the BIOS table is as follows:
So the basics are there just now. Enough to make a test with input from UART and output to text-mode display. Note the division by zero entry – there is no hardware divide. I have a naive division function, and throws int 0‘s when division by zero is encountered.
There are obviously restrictions to this; for example, you really should not use the int instruction when inside of an interrupt handler, but it can happen, nothing stops you from doing it. Keeping to these restrictions will be tricky. I have started looking at how the hardware can disable/restore interrupt enable states and also handle when an invalid int instruction is encountered.
I went and moved all the code of my simple bios example over to this new system based on software interrupts. I also changed Branch Immediates to Branch Immediate Relative Offset, but didn’t fish out other opportunities for it significantly.
The output on screen was corrupt. data was being clobbered somewhere, so I needed to debug..
Debugging the issue
TPU does not have debugging functionality for the software running on it, but, you can use the ISIM simulator, and edit the assembly code to at least allow for some ‘what are the registers at point X’ information.
It is incredibly time consuming, but invaluable when things go wrong, which inevitably did when I tried out using the int instruction. I’ve mentioned in previous parts than in ISim you can inspect the contents of the internal Block Ram memories and other signals. This allows you to see register values, and the contents of the text ram. You can set up the data view to show ASCII, and get a representation of the characters you’d want displayed.
What I tend to do is place a branch to self such as “biro 0” where I want to inspect registers, then run the code in the simulator. I could also use the TPU emulator, but I’ve yet to implement the full set of opcodes in it yet. Doing this around various places led me to discover that the r0 register was being overwritten on execution of int instructions. The value written was always 0x0008 – which to those who have read all previous TPU parts may recognize as the interrupt vector.
The way the ALU works is that various signals are always calculated and made available, but then the relevant one is selected, usually by a signal from the decoder.
when OPCODE_SPEC_F2_INT => s_result(15 downto 0) <= ADDR_INTVEC; s_interrupt_rpc <= std_logic_vector(unsigned(I_PC) + 2); s_interrupt_register <= X"00" & "00" & I_dataIMM(7 downto 2); s_prev_interrupt_enable <= s_interrupt_enable; s_interrupt_enable <= '0'; s_shouldBranch <= '1';
The instruction is implemented in the ALU simply – like a standard branch. If s_shouldBranch is ‘1’, the PC unit takes the next PC from the output of the ALU, which is s_result(15 downto 0) – the Interrupt Vector ADDR_INTVEC (0x0008).
Looking back at the int instruction layout:
You can see that the bits where a destination register should be (11-8) in an Rd-form instruction are all 0. The issue we have is that register r0 is having 8 written to it after an int instruction, and in the same way a branch uses the output of the ALU, if a register write is requested, the same value – the output of the ALU – will be used as the source data.
Looking at the decoder, I saw that I’d forgot to add the new instruction. I’d only added it to the ALU block. The decoder sets the write enable signal for the destination register:
case I_dataInst(IFO_F2_BEGIN downto IFO_F2_END) is when OPCODE_SPEC_F2_GETPC => O_regDwe <= '1'; when OPCODE_SPEC_F2_GETSTATUS => O_regDwe <= '1'; when others => end case;
With O_regDwe not being set in the event of the OPCODE_SPEC_F2_INT case, it would preserve it’s previous state. Before the int instruction as usually an immediate load into a register – meaning the regDwe signal would be left enabled, and produce the exact incorrect behavior I saw.
case I_dataInst(IFO_F2_BEGIN downto IFO_F2_END) is when OPCODE_SPEC_F2_GETPC => O_regDwe <= '1'; when OPCODE_SPEC_F2_GETSTATUS => O_regDwe <= '1'; when OPCODE_SPEC_F2_INT => O_regDwe <= '0'; when others => end case;
Quickly adding the case for our new INT instruction, things worked perfectly.
New instruction stats
Now that we have the int instruction, I collated the number of instructions for the same functions, and got the expected reduction in code size.
Finding more bugs
It’s expected that as the amount of TPU code I write grows, the more issues I find in various parts of it’s implementation. Some are obvious bugs, whilst others may be hazards in the CPU itself. One such bug was in the Memory Interrupt Process, only introduced in the last part. It only ever fired once, due to a check for the state-machine state being in the wrong place:
... if rising_edge(cEng_clk_core) and MEM_access_int_state = 0 then if MEM_Access_error = '1' then I_int <= '1'; ...
The check for MEM_access_int_state should have been after the clock edge check:
... if rising_edge(cEng_clk_core) then if MEM_Access_error = '1' and MEM_access_int_state = 0 then I_int <= '1'; ...
With this edit in place, multiple memory violation interrupt requests can take place. This is required for some code that I wrote which displays the ‘memory map’ of the system. A read is performed at each 2KB boundary, and if an interrupt violation is encountered it is assumed the whole bank/block is unmapped. This continues until all 32 banks are checked – and displays this to the screen as it happens.
- 0x0000 – 0x3FFF: RAM
- 0x9000 – 0x97FF: Memory mapped I/O
- 0xA000 – 0xAFFF: Font Ram
- 0xB000 – 0xBFFF: Text Ram
- 0xC000 – 0xEFFF: VRAM
Getting back to the start of this post, where we discussed simple command execution, this memory map prints out if the ‘m’ command is entered through the UART. There are two other commands currently, ‘l’ for lighting all LEDs and ‘d’ for darkening them.
There is also a current issue where UART input seems to generate multiple instances of the same character from my read function. I’ve yet to look at that problem in detail just yet.
So in this part, we’ve investigated some ‘BIOS’ code, modified the branch immediate instruction, added a software interrupt instruction, and fixed some issues in their implementation.
At the moment, I’m finding the hardest part of TPU now bad tooling. TASM for assembling code was good and served its purpose when I needed a quick fix for generating small tests. However, now that I’m thinking of sending program code over UART which is ‘compiled’ as position independent code, TASM is showing significant weaknesses. I’ll need to get around that somehow.
For now that’s the end of this part. Many thanks for reading. As always, if you have any comments of queries, please grab me on twitter @domipheus.