This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.
Part 10 was supposed to be a very big part, with a special surprise of TPU working with a cool peripheral device, but that work is still ongoing. It’s taking a long time to do, mostly due to being busy myself over the past few weeks. However, in this update, I’ll look at bringing interrupts to TPU, as well as fixing an issue with the embedded ram that was causing bloating of the synthesized design.
Interrupts are needed on a CPU which is expected to work with multiple asynchronous devices whilst also doing some other computation. You can always have the CPU poll, but sometimes that isn’t wise and/or suitable given other constraints. It’s also good for keeping time with something – vsync, for example. This is where interrupts come in – where a signal fed to the CPU externally can “interrupt” what the CPU is currently executing, and perform some other computation before returning to it’s previous task.
The way I have implemented the interrupts is similar to the Z80 maskable interrupts, with an external interrupt input and an interrupt acknowledge output. The system is simplified and doesn’t have the different types of modes and non-maskable interrupts available on the Z80 but it should be enough for the needs of TPU. You can only handle a single request at a time, and there is only one mode to work with – but it’s powerful enough for most situations.
An overview of how the interrupts will work are as follows:
- At some point during execution, the system will make the interrupt input to TPU high, indicating they want the interrupt handler run.
- At the next writeback stage of the pipeline, just before migrating to the fetch stage, the interrupt input is sampled.
- If an interrupt is requested, the control unit will then make the interrupt acknowledge output from TPU active.
- Once the interrupt ACK signal is seen externally to TPU, 16-bits of data can be placed on the data input to TPU.
- After a predetermined number of cycles, the bits on the data in bus are stored.
- The ACK is de-asserted, and the PC of TPU is set to the interrupt handler.
- The handler can retrieve the data from the data bus via a new instruction, and also return to the previous PC before the interrupt was acknowledged.
- The external interrupt input is latched, so until it goes inactive for a cycle, remaining active will not invoke another interrupt handler invocation.
It’s very important that the interrupt input is only acted upon during the end of the writeback stage. Doing it at any other point can result in an inconsistent execution state, whereby we do not know if the current instruction has executed to completion. Doing the interrupt at the end of a writeback means:
- the PC we save (to return to later) is already the ‘next’ PC, be that prev_pc+2, or a branch target;
- memory reads have had time to complete successfully; and
- any registers have had time to see and act upon write enable signals to store data.
The items that are needed, therefore, are:
- Internal registers for the stored PC (to return to after interrupt handler), the interrupt data field passed on the data in bus, and an interrupt enable bit
- Various connections between the parts of the sub-modules for handling storing of the PC and interrupt data
- Control unit additions for the interrupt handler step
- New instructions for getting interrupt data and returning from an interrupt
Internal registers & Connections
I added a 16-bit register for the ‘next PC’ and also the ‘interrupt data’ to the ALU itself, rather than adding it to the register file. There are individual set/write control lines and also data lines for them into the ALU. It’s a bit messy and adds a lot of ports to the ALU and control unit, but it worked and I can change this later if I want to tidy things up. Having the registers part of the ALU makes the instructions that access them incredibly simple and self contained.
Control unit additions
The control unit now has an interrupt state, all of the control signals for setting the registers in the ALU and also the logic for managing the phases of calling into the interrupt handler. If interrupts are enabled, the interrupt input is active and it’s the end of the writeback phase, the following occurs:
- Interrupt_ack is activated
- A cycle of latency is provided
- The bits on the data in bus are sampled and the ALU instructed to store this value
- The current PC (which is, at this point, the next instruction to execute) is saved by the ALU
- The PC unit sets the current PC to the interrupt vector, currently fixed at 0x0008.
- The control unit resets it’s interrupt state, and proceeds to the fetch stage of the pipeline.
At the moment, interrupts are not disabled automatically when the handler is invoked, so the first instruction must be a disable interrupt instruction.
There are four new instructions used to manage and handle interrupts.
The Get Interrupt Event Field transfers the value on the data bus at the time after an interrupt acknowledge into a register for further use. Using this value, we can work out what caused the interrupt and perform further actions from that point. An example of this is using it with a UART, the interrupt data field could contain the uart identifier in the high 8 bits, and the byte of data which was received in the lower 8 bits.
The interrupt vector
The interrupt vector is fixed at address 0x0008. The shape of the interrupt handler should be something like the following:
- disable interrupts
- Save all registers
- get the interrupt event data field
- Perform action according to interrupt event field, or add the field data to a queue for later processing.
- restore all registers
- enable interrupts
- Branch back to ‘normal’ code.
Saving the registers can be done by saving to the current stack and then restoring before returning from the handler. I’ve been using r7 as a ‘standard’ stack pointer in our very ad-hoc ABI spec, so this can be done. This does use user stack, though, so it needs taken into account if stack space is a particular concern.
There are a few issues that could occur, mainly in timing between disabling and enabling the interrupts. There could be a new interrupt to be handled when the enable interrupts instruction is processed, and this interrupt will then be accepted before the bbi instruction to branch back. This will destroy the original PC value when the original interrupt was raised, so I will probably change things around. There are a few solutions to this, one being that interrupts are by definition disabled when the branch to the interrupt vector occurs, and then a bbi instruction implicitly turns interrupts on again. I’ll need to have a think about the best course of action for this.
The makeup of the test interrupt routines I’ve had are like the following (snipped for clarity)
entry: load.h r7, 0x08 subi r7, r7, 4 bi $start dw 0x0000 intvec: #interrupt vector 0x8 di # save the registers gief r0 # inspect r0 for interrupt type # branch to some other work # restore the registers ei bbi start: load.l r0, 0 ...
The interrupt handler, whilst a bit messy in it’s implementation, works well in simulation. I’ve yet to use it when TPU is running on the FPGA with an external source, but I do not foresee many issues other than the one stated above.
A Look in the simulator
The above waveform is showing an interrupt being flagged on a UART receive event, the event field containing the UART ID (1) and the byte value received (0x4f). Walking through the waveform, we get the following:
- The UART has received a byte and signaled this.
- An interrupt is immediately raised.
- Several cycles later the ACK is signaled by the cpu
- The interrupt event field(IEF) data is placed on the data in bus after a cycle of delay
- The ACKis de-signaled, and the IEF is removed from data in bus and saved internally (to later be used via the gief instruction)
- The CPU branches to the interrupt vector 0x0008, requesting the instruction from memory
The internal RAM
I mentioned previously that the design resources had shot up, and it turns out this is due mainly to the internal ram not being synthesized as a block ram. I was getting an internal compiler error in the Xilinx toolchain when building the existing ram with a larger capacity (I think it was 512bytes at this point) and to counter this I re-implemented the ram in another way. The way I did it, though, added an asynchronous element which in turn forced the toolchain to implement the RAM via look up tables, instead of utilizing the block ram. This is why there was a jump in resource requirements when using the Spartan6.
I could not get around the internal compiler error without an async element, so off to the documentation for the spartan6 I went. Turns out there is a document specifically on the block rams available on the device I have.
The block rams are used by initializing a generic object in VHDL to various constants, and then interfacing with the ports that object exposes. There are two kinds of block rams available, but I decided to use the 18 kilobit, dual-port one: RAMB16BWER. It is made up of 16Kb for data and 2Kb for parity. ISE has a nice template library for instantiation of primitives, and the block ram I use is included. It can be found within Edit->Language Templates, and then within the VHDL->Device Primitives->Spartan6->RAM/ROM.
Despite having the existing integrated ram address bytes explicitly, I decided against that with the block ram and instead addressed 16-bit values. To the TPU programmer, it still addresses bytes, but internally, it’s really stored at 16-bit, 2 byte blocks. The main reason for this was latency and complexity. By addressing 16-bit values internally in the block ram, I can implement both 16-byte reads/writes and also 8-bit reads and writes using a single port. The RAMB16BWER has a byte-wise write enable, so I can write either the high or low 8bits of a memory location internal to the block ram, leaving the other half untouched. There is one issue that arises from this method – an unaligned 16-bit read/write (i.e, the address being odd) will result in incorrect behavior. At the moment nothing happens if you try this, but I intend to add a trap/exception. I could maybe invoke the interrupt handler with a known interrupt event field value to specify an unaligned memory operation.
There were several gotchas I encountered whilst trying the block ram with a testbench. The addressing scheme, first of all, was confusing. As the generic component was initialized with relevant 16-bit addressing (18bit when you include parity), I assumed it would transform the address itself into the correct form. This did not seem to be the case after running the test bench. the documentation has a table of mappings and also a formula, but in the end it only took a few minutes of inspection in the simulator to work out what was happening.
The next issue was a rather silly affair! The initialization attributes for the block ram are from most-significant to least-significant order. Due to this, 16-bit instructions need byte-flipped when read in the code, and also, they go from right to left along the initialization attribute.
-- BEGIN TASM RAMB16BWER INIT OUTPUT INIT_00 => X"06831180E27F00300000004F4C4C454801E102E100EF03E100000CC1E91E088E",
Maps to the instruction forms (only first 3 instructions shown):
X"8E", X"08", -- 0000: load.h r7 0x08 X"1E", X"E9", -- 0002: subi r7 r7 4 X"C1", X"0C", -- 0004: bi 0x0018 ...snip...
I will not admit the amount of time spent trying to figure out the issue of byte flipping in the initialization attribute 😉
The least significant digit of the address, specifying the high/low byte of the 16-bit memory location, is managed in the VHDL process. Ive put that process (and other relevant signal operations) below for clarity. It’s a large block of text even without some of the less important generic attributes/initializations, which I have omitted.
RAMB16BWER_inst : RAMB16BWER generic map ( -- DATA_WIDTH_A/DATA_WIDTH_B: 0, 1, 2, 4, 9, 18, or 36 DATA_WIDTH_A => 18, DATA_WIDTH_B => 18, ...snip... -- SIM_COLLISION_CHECK: Collision check enable "ALL", "WARNING_ONLY", "GENERATE_X_ONLY" or "NONE" SIM_COLLISION_CHECK => "ALL", -- SIM_DEVICE: Must be set to "SPARTAN6" for proper simulation behavior SIM_DEVICE => "SPARTAN6", ...snip... ) port map ( -- Port A Data: 32-bit (each) output: Port A data DOA => DOA, -- 32-bit output: A port data output DOPA => DOPA, -- 4-bit output: A port parity output -- Port B Data: 32-bit (each) output: Port B data DOB => DOB, -- 32-bit output: B port data output DOPB => DOPB, -- 4-bit output: B port parity output -- Port A Address/Control Signals: 14-bit (each) input: Port A address and control signals ADDRA => ADDRA, -- 14-bit input: A port address input CLKA => CLKA, -- 1-bit input: A port clock input ENA => ENA, -- 1-bit input: A port enable input REGCEA => REGCEA, -- 1-bit input: A port register clock enable input RSTA => RSTA, -- 1-bit input: A port register set/reset input WEA => WEA, -- 4-bit input: Port A byte-wide write enable input -- Port A Data: 32-bit (each) input: Port A data DIA => DIA, -- 32-bit input: A port data input DIPA => DIPA, -- 4-bit input: A port parity input -- Port B Address/Control Signals: 14-bit (each) input: Port B address and control signals ADDRB => ADDRB, -- 14-bit input: B port address input CLKB => CLKB, -- 1-bit input: B port clock input ENB => ENB, -- 1-bit input: B port enable input REGCEB => REGCEB, -- 1-bit input: B port register clock enable input RSTB => RSTB, -- 1-bit input: B port register set/reset input WEB => WEB, -- 4-bit input: Port B byte-wide write enable input -- Port B Data: 32-bit (each) input: Port B data DIB => DIB, -- 32-bit input: B port data input DIPB => DIPB -- 4-bit input: B port parity input ); -- End of RAMB16BWER_inst instantiation -- --todo: assertion on non-aligned 16b read? -- CLKA <= I_clk; CLKB <= I_clk; ENA <= I_cs; ENB <= '0';--port B unused ADDRA <= I_addr(10 downto 1) & "0000"; process (I_clk, I_cs) begin if rising_edge(I_clk) and I_cs = '1' then if (I_we = '1') then if I_size = '1' then -- 1 byte if I_addr(0) = '1' then WEA <= "0010"; DIA <= X"0000" & I_data(7 downto 0) & X"00"; else WEA <= "0001"; DIA <= X"000000" & I_data(7 downto 0); end if; else WEA <= "0011"; DIA <= X"0000" & I_data(7 downto 0)& I_data(15 downto 8); end if; else WEA <= "0000"; WEB <= "0000"; if I_size = '1' then if I_addr(0) = '0' then data(15 downto 8) <= X"00"; data(7 downto 0) <= DOA(7 downto 0); else data(15 downto 8) <= X"00"; data(7 downto 0) <= DOA(15 downto 8); end if; else data(15 downto 8) <= DOA(7 downto 0); data(7 downto 0) <= DOA(15 downto 8); end if; end if; end if; end process; O_data <= data when I_cs = '1' else "ZZZZZZZZZZZZZZZZ";
The last thing to do was to add another output file generator to TASM, my c# TPU assembler. This simply outputs the whole 2KB initialization table for the input assembly. It’s then just copy/pasted into the VHDL in the appropriate attribute location.
That’s it for this part. I really hope to have the next part with TPU talking to a peripheral device (and some changes to the ISA) in the next week or two. Fingers crossed!
Thanks for reading, comments as always to @domipheus.