Porting my VHDL Character Generator to Spartan3: Reducing clock speeds and pipelining

This is an article on porting my VHDL character generator from a Xilinx Spartan6 device to one with a Spartan3. It starts off as a simple port, analyzing device primitive differences and accounting for them in the design. Along the way, there were considerations on how clocks were generated, characteristics of block ram timing, and general algorithmic design. I’ll assume you’ve read the sections of my Designing a CPU in VHDL series specifically detailing the implementation of the character generator.

Reading time: 10 minutes

When I first attempted to synthesize my TPU CPU Core design on to the miniSpartan3 developer board (made by the great folks at Scarab Hardware), the bulk of the code went without a hitch. The processor core itself contains no primitive parts specific to a single vendor. However, the rest – Block Rams, Clock Generators – used instantiations of specific device primitives. These are different from family to family of FPGAs and those are where the most thought and investigation is needed, as changes can have knock-on impacts to operations further along the device path.

High level device differences are fairly minimal. On the board itself, we have a 32MHz clock input on the miniSpartan3 board instead of the 50MHZ input on the miniSpartan6 setup. So we will need to change ratios of how we generate pixel and ram clocks for the DVI-D/HDMI video output. The FTDI chip for serial communication is similar and a communications channel is connected to the FPGA. We will need to change the constraints file for the pin definitions as well, but that’s always expected.

Clocks

The Spartan6 TPU design utilizes multiple clocks:

  • Base Clock 50MHz
  • CPU Core 100MHz
  • Pixel 25MHz
  • 5x Pixel 125MHz
  • 5x Pixel Inverted 125 MHz
  • Char/Text Clock 250MHz

The CPU Core and Read/Write Port A’s of all Block Rams use the CPU Core clock. The UART Baud Generator uses the Base Clock. The VGA/Graphics signal generator uses the Pixel clock. The TMDS/DVI-D/HDMI encoders and output buffers use the 5x Pixel and 5x Pixel Inverted clocks. The Character generator, and relevant Port B block rams (Text and Font Rams) use the Char/Text Clock.

When porting to Spartan3, we need to use Digital Clock Managers (DCMs) instead of the Phase Locked Loops (PLLs) on Spartan6. The interface to DCMs is considerably different, but the base terms remain and you can understand what needs to change in the design without much thought.

s6_pll

One of the main issues is that DCMs have much less outputs than the PLLs. On the Spartan6 implementation, a single PLL primitive is used to drive all of the different clocks require. On Spartan3, we will need a DCM for each frequency.

s3_dcm

Due to this, we will require 3 DCM objects. Our Spartan3 chip XC3S200A only has 4 in total, so we are using a significant amount of resources to generate these clocks. However, we do have the available DCMs to get started immediately.

The DCMs themselves have multiple configurations to set up. We use the clock synthesizer(DFS) to get our 25MHz pixel clock from our 32MHz input. The maximum rangers for the DFS is outlined in the Spartan-3A datasheet.

dfs_limits

To generate the pixel clock, we multiply our 32MHz input by 15 to 480MHz then divide by 19 to get 25.2MHz.

-- 32MHz -> ~25MHz
DCM_SP_inst : DCM_SP
generic map (
  CLKDV_DIVIDE => 2.0, --  Divide by: 1.5,2.0,2.5,3.0,3.5,4.0,4.5,5.0,5.5,6.0,6.5
                       --     7.0,7.5,8.0,9.0,10.0,11.0,12.0,13.0,14.0,15.0 or 16.0
  CLKFX_DIVIDE => 19,         --  Can be any interger from 1 to 32
  CLKFX_MULTIPLY => 15,       --  Can be any integer from 1 to 32
  CLKIN_DIVIDE_BY_2 => FALSE, --  TRUE/FALSE to enable CLKIN divide by two feature
  CLKIN_PERIOD => 32.0,       --  Specify period of input clock
  CLKOUT_PHASE_SHIFT => "NONE", --  Specify phase shift of "NONE", "FIXED" or "VARIABLE" 
  CLK_FEEDBACK => "1X",         --  Specify clock feedback of "NONE", "1X" or "2X" 
  DESKEW_ADJUST => "SYSTEM_SYNCHRONOUS", -- "SOURCE_SYNCHRONOUS", "SYSTEM_SYNCHRONOUS" or
                                         --     an integer from 0 to 15
  DLL_FREQUENCY_MODE => "LOW",     -- "HIGH" or "LOW" frequency mode for DLL
  DUTY_CYCLE_CORRECTION => TRUE,   --  Duty cycle correction, TRUE or FALSE
  PHASE_SHIFT => 0,        --  Amount of fixed phase shift from -255 to 255
  STARTUP_WAIT => FALSE)   --  Delay configuration DONE until DCM_SP LOCK, TRUE/FALSE
port map (
  CLK0 => CLK0,     -- 0 degree DCM CLK ouptput
  CLK180 => CLK180, -- 180 degree DCM CLK output
  CLK270 => CLK270, -- 270 degree DCM CLK output
  CLK2X => open,    -- 2X DCM CLK output
  CLK2X180 => open, -- 2X, 180 degree DCM CLK out
  CLK90 => open,    -- 90 degree DCM CLK output
  CLKDV => open,    -- Divided DCM CLK out (CLKDV_DIVIDE)
  CLKFX => clock_pixel_unbuffered,   -- DCM CLK synthesis out (M/D)
  CLKFX180 => CLKFX180, -- 180 degree CLK synthesis out
  LOCKED => LOCKED, -- DCM LOCK status output
  PSDONE => PSDONE, -- Dynamic phase adjust done output
  STATUS => open,   -- 8-bit DCM status bits output
  CLKFB => CLKFB,   -- DCM clock feedback
  CLKIN => clk32_buffered,   -- Clock input (from IBUFG, BUFG or DCM)
  PSCLK => open,    -- Dynamic phase adjust clock input
  PSEN => open,     -- Dynamic phase adjust enable input
  PSINCDEC => open, -- Dynamic phase adjust increment/decrement
  RST => '0'        -- DCM asynchronous reset input
);
	

Block Rams

The block rams on Spartan3 are very similar to the Spartan6 counterparts. They do have different characteristics in terms of timings and therefore maximum operating frequency. The Block Ram primitives on my Spartan3 are not rated for the ~260MHz that those on the Spartan6 run at – so there will be changes required to the Character generator as to account for additional latency in the memory operations.

UART

Thankfully, Xilinx provide the PicoBlaze UART objects for Spartan3 a they do for Spartan6, so there was very little work required in porting these over, apart from using different library objects. The Baud clock routine was changed to strobe correctly using the 32MHz base clock instead of 50MHz on the miniSpartan6. That was the only significant change here.

Differential Signalling Buffers

The OBUFDS output buffers used before can be used on Spartan3, along with the ODDR2 Double Data Rate registers for generating the 10x HDMI signalling.

Character Generator

Most work was on the Character Generator. This was due to the base algorithm of the system needing slight amendments to account for the increased memory latencies and slower clocks. However, I think it’s useful to see what things can happen if we ignore all of that for a second, and simply ‘blind port’ the system ignoring the rated maximum frequencies, just to see what happens.

In fact, the blind port was my first attempt. And this was the result:

There are a few points of interest that you can take from this footage. I’ve singled out a frame to identify them easier.

corruption

  1. The Colour are correct for the areas they should be
  2. The Glyphs seem to be correct
  3. The corruption while random occurs in X directions, as there are bars which are consistent across character locations.

If we look again at the state diagram for the character generator:

text_mode_diagram

As we can tell that the character along with the colour is correct, we know it’s not data corruption in transfer. However, the vertical banding is occurring at the start of the character, indicating the glyph row data is not getting to the system in time.

blockrammhz

In this situation I went to the datasheets and application notes to find maximum frequency ratings for the clock synthesizers and block rams. The 250MHz char/pixel clock is well within specification for generation, but the block rams are only rated for 200MHz. Instead of attempting to redesign the character generator to run off a new slightly slower clock (200MHz), I started modifying it so that it would operate correctly at the 5x pixel serialization clock – as this would free up another DCM object and reduce our utilization from 3 to 2.

The way I started with this problem was to delay the character generator by a single pixel, allowing to pipeline the memory requests up over two pixels instead of one. This would then give us the 10 sub-cycles per pixel.

pipeline_diagram

Table A) shows how 10 sub-cycles are required, table B) shows how they would fit together into a pipelined 5 sub-cycle state machine and C) shows that optimized, as certain stages need only occur when you crossover into a new character. The latching and fetching of the glyph data is idempotent and does not incur additional costs, as the tram data which derives the addresses for glyph rows is only fetched on each 8-pixel character transition.

My first implementation of this seemed to work well enough, apart from there being a duplicated pixel in the glyph.

lastorfirstduplicated

It is harder than you’d think to tell whether this duplicated pixel was at the start or end of glyph processing, so I forced the background colour of the screen to flip-flop between character locations as they were output, allowing you to see the specific zone that a glyph should reside within.

bars_debug

From here, I could tell it was the first pixel which was causing the problem. The first one encapsulates all memory requests – with the further 7 pixels in a glyph row only utilizing a cached version of the data, so from here it was time to go into the simulator and look at some internal signals.

In the Simulator

The first thing to notice was there was disparity between the pixel x coordinate and the actual pixel/5x pixel clocks. Due to the pixel operations being driven from the 5x clock, there could be instances where mid-request there was changes in coordinates. The way to fix this was to have a process driven off of the pixel clock, which then latched the X and Y coordinates, which then the various other logic driven from the 5x clock could utilize.

latching_coords_2

You can see in the above waveform the issue clearly. At (a) we kick off a new initial glyph row request (State 1 is only ever entered on the first pixel of a character row). If we do not latch the coordinate, half way though the request at (b) we could have the coordinate flip.

Since we already had a process running from the pixel clock to manage the blinker flags, this was a simple addition.

  -- This process latches the X and Y on the pixel clock
  -- Also manages the blinker.
  process(I_clk_pixel)
  begin
    if rising_edge(I_clk_pixel) then
      blinker_count <= blinker_count + 1;
      x_latched <= I_x;
      y_latched <= I_y;
    end if;
  end process;

The last issue was a really irritating one. Irritating as it was a very basic bug in my code. A simple state < 4 check should have been <= 4, meaning the 4 state prolonged an additional cycle, throwing the first pixel off. Easily fixed, and easily spotted in the simulator.

good

The last thing to do was to also try it on my miniSpartan6+ project, and it worked first time – which is great 🙂

Wrap Up

We now have the character generator running off of a 5x pixel clock, with the font/text ram read ports also running at that slower clock. As well as allowing us to run to the Spartan3 FPGA specs of the device I have, it will additionally allow for higher resolutions in the future – especially on the Spartan6 variant.

Thanks for reading, let me know what you think on Twitter @domipheus.

Designing a CPU in VHDL, Part 14: ISA changes, software interrupts and bugfixing that BIOS code

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

It’s finally that time! I have committed the latest TPU VHDL, assembler and ISA to my github repository. Fair warning: The assembler is _horrid_. The VHDL contains a ISE project for the LX25 miniSpartan6+ variant, but the font ROM is empty. The font ROM I’ve been testing with doesn’t have a license attached and I don’t want to blindly add it here. You can, however simulate with ISim, and directly inspect the Text Ram blocks in ASCII mode to see any text displayed. I will explain that more later in the part.

The ‘BIOS’ code

So just now we have a TPU program which prints a splash message, checks for the size of the contiguous memory area from 0x0000 (slowly), and then waits for a command from the input UART. This UART is connected to the FTDI chip on port B, so appears as a COM port to a computer when the USB cable is connected to the miniSpartan6+. It only accepts a single character command just now, mainly because I have chosen the path of progress here rather than the path of lovely command-line words, which involves several string functions that honestly nobody should need to waste time writing (yes, looking at you, tech interviewers).

Getting to this point, without even writing significant code to handle a command, I realized that the code was so big (~1.5KB) that I’d have trouble fitting it into a single block ram. TASM, the Tpu ASseMbler, currently only outputs a single block ram initializer, and the VDHL duplicates that single intializer across each memory bank, so it would be a lot of work to fix all of that. I instead wanted to look at why exactly there is so much code, for such little functionality.

I edited TASM to output instruction metrics, simply a count for each operation. I then checked what was the biggest, ignoring the define word (dw) operation, as string constants were significant but not really relevant. This was the result:

before_instcount_pie

So you look at that and scan for counts that don’t make any sense, there are a few that jump right out. For instance, Branch Immediate (bi) only used twice in all this code? 1400 lines of assembly and only 2 branch immediates?

When investigating the code, I realized why. The 16 bit op codes don’t leave a lot of room for immediate values, even if they are shifted to account for 2-byte alignment of instructions. To play it safe, I was instead loading into the high and low bytes of registers, and branching to a register. So 4 instructions (load.l, load.h, or, br) instead of a single branch immediate. There were also lots of function calls, and those branches were performed using Branch Register.

More puzzling was the differences in write.w and read.w count. 111 write words, versus 38 read words? Given most of the reads and writes in the code was stack manipulation for the various function calls, it seemed like there was significant overspill saving registers that were then not required to be read (basically, myself being over cautious).

So, I decided to perform two changes to the ISA:

First, I would remove the Branch Immediate instruction, replacing it with a Branch Immediate Relative Offset. This will allow for the largest possible range, and help with eventual Position Independent Code, as the immediate is a signed offset from the current program counter. Perfect for use when jumping around inside functions.

Second, I will introduce an interrupt instruction, int. This will allow for the software-defined invocation of the interrupt handler, with an immediate value being provided to the Interrupt Event Field. Then I will be able to replicate the old IBM/DOS BIOS routines – where interrupts were signaled to perform low-level tasks.

So, here are the two new instructions. They are part of ISA 1.5, which is on github:

int_biro_defns

Calling conventions

Currently, my code sets up a stack at a known high address. This could then be reset to another location after the memory test, but isn’t done just yet. The stack is pointed to by r7, and currently I’ve defined all other registers r0-r6 to be volatile for the purpose of calling conventions. This simply means that if you need the value of a register preserved across a function call, you must save the value onto the stack before you call. I mentioned function calling and the instructions used in a previous part, but here is a reminder of how it’s done:

# Call setcursor (9, 2)

subi    r7, r7, 2             #reserve stack for preserve
write.w r7, r0, 0             #we want to preserve r0
  
subi    r7, r7, 6             #reserve stack for call
load.l  r1, 9                 #2 word arguments and a return address = 6 bytes
load.l  r2, 2
write.w r7, r1, 2             #arg0 x=9
write.w r7, r2, 4             #arg1 y=2

spc     r6                    #get the current PC value
addi    r6, r6, 14            #offset that value past the branch below
write.w r7, r6                #save the return pc

load.h  r0, $setcursor.h      
load.l  r1, $setcursor.l
or      r0, r0, r1            #build a register containing function location
br      r0                    #branch to the function label
   
addi    r7, r7, 6             #restore stack 

read.w  r0, r7, 0             #read the old r0 back
addi    r7, r7, 2             #restore stack (preserve)

So lots of writes and bloat for every call, but it works well enough. I have return values passed back in registers. By implementing BIOS routines for these I/O functions using software interrupts, we should reduce the code size considerably.

BIOS Routines

The new int instruction allows user code to jump into the interrupt handler, with a user-defined Interrupt Event Field. The space in the immediate area of the opcode allows for 64 values, so we define Interrupt Event Field values less than 64 to be only applicable to software routines.

In the interrupt handler, all registers except r7 are immediately saved to the stack. Of course this means the stack must always have enough space for this, but assume that is true for now. If the Event Field is < 64, we use this value to get a function address from a table. We then setup a function call and branch. On return, we save r0 back to the stack, directly where it will be restored from. This allows for data to be returned from BIOS function calls. Currently, the BIOS table is as follows: interrupts

So the basics are there just now. Enough to make a test with input from UART and output to text-mode display. Note the division by zero entry – there is no hardware divide. I have a naive division function, and throws int 0‘s when division by zero is encountered.

There are obviously restrictions to this; for example, you really should not use the int instruction when inside of an interrupt handler, but it can happen, nothing stops you from doing it. Keeping to these restrictions will be tricky. I have started looking at how the hardware can disable/restore interrupt enable states and also handle when an invalid int instruction is encountered.

I went and moved all the code of my simple bios example over to this new system based on software interrupts. I also changed Branch Immediates to Branch Immediate Relative Offset, but didn’t fish out other opportunities for it significantly.

The output on screen was corrupt. data was being clobbered somewhere, so I needed to debug..

Debugging the issue

TPU does not have debugging functionality for the software running on it, but, you can use the ISIM simulator, and edit the assembly code to at least allow for some ‘what are the registers at point X’ information.

memory__regs

It is incredibly time consuming, but invaluable when things go wrong, which inevitably did when I tried out using the int instruction. I’ve mentioned in previous parts than in ISim you can inspect the contents of the internal Block Ram memories and other signals. This allows you to see register values, and the contents of the text ram. You can set up the data view to show ASCII, and get a representation of the characters you’d want displayed.

memory_tram

What I tend to do is place a branch to self such as “biro 0” where I want to inspect registers, then run the code in the simulator. I could also use the TPU emulator, but I’ve yet to implement the full set of opcodes in it yet. Doing this around various places led me to discover that the r0 register was being overwritten on execution of int instructions. The value written was always 0x0008 – which to those who have read all previous TPU parts may recognize as the interrupt vector.

The way the ALU works is that various signals are always calculated and made available, but then the relevant one is selected, usually by a signal from the decoder.

when OPCODE_SPEC_F2_INT =>
    s_result(15 downto 0) <= ADDR_INTVEC;
    s_interrupt_rpc <= std_logic_vector(unsigned(I_PC) + 2);
    s_interrupt_register <= X"00" & "00" & I_dataIMM(7 downto 2);
    
    s_prev_interrupt_enable <= s_interrupt_enable;
    s_interrupt_enable <= '0';
    s_shouldBranch <= '1';	

The instruction is implemented in the ALU simply – like a standard branch. If s_shouldBranch is ‘1’, the PC unit takes the next PC from the output of the ALU, which is s_result(15 downto 0) – the Interrupt Vector ADDR_INTVEC (0x0008).

Looking back at the int instruction layout:

int_inst

You can see that the bits where a destination register should be (11-8) in an Rd-form instruction are all 0. The issue we have is that register r0 is having 8 written to it after an int instruction, and in the same way a branch uses the output of the ALU, if a register write is requested, the same value – the output of the ALU – will be used as the source data.

Looking at the decoder, I saw that I’d forgot to add the new instruction. I’d only added it to the ALU block. The decoder sets the write enable signal for the destination register:

case I_dataInst(IFO_F2_BEGIN downto IFO_F2_END) is
  when OPCODE_SPEC_F2_GETPC =>
    O_regDwe <= '1';
  when OPCODE_SPEC_F2_GETSTATUS =>
    O_regDwe <= '1';
  when others =>
end case;

With O_regDwe not being set in the event of the OPCODE_SPEC_F2_INT case, it would preserve it’s previous state. Before the int instruction as usually an immediate load into a register – meaning the regDwe signal would be left enabled, and produce the exact incorrect behavior I saw.

case I_dataInst(IFO_F2_BEGIN downto IFO_F2_END) is
  when OPCODE_SPEC_F2_GETPC =>
    O_regDwe <= '1';
  when OPCODE_SPEC_F2_GETSTATUS =>
    O_regDwe <= '1';
  when OPCODE_SPEC_F2_INT =>
    O_regDwe <= '0';
  when others =>
end case;

Quickly adding the case for our new INT instruction, things worked perfectly.

New instruction stats

Now that we have the int instruction, I collated the number of instructions for the same functions, and got the expected reduction in code size.

after_instcount_pie

instruction_count

Finding more bugs

It’s expected that as the amount of TPU code I write grows, the more issues I find in various parts of it’s implementation. Some are obvious bugs, whilst others may be hazards in the CPU itself. One such bug was in the Memory Interrupt Process, only introduced in the last part. It only ever fired once, due to a check for the state-machine state being in the wrong place:

...
if rising_edge(cEng_clk_core) and MEM_access_int_state = 0 then
    if MEM_Access_error = '1' then
      I_int <= '1';
...

The check for MEM_access_int_state should have been after the clock edge check:

...
if rising_edge(cEng_clk_core) then
			if MEM_Access_error = '1' and MEM_access_int_state = 0 then
				I_int <= '1';
...

With this edit in place, multiple memory violation interrupt requests can take place. This is required for some code that I wrote which displays the ‘memory map’ of the system. A read is performed at each 2KB boundary, and if an interrupt violation is encountered it is assumed the whole bank/block is unmapped. This continues until all 32 banks are checked – and displays this to the screen as it happens.

memmap

Explanation:

  • 0x0000 – 0x3FFF: RAM
  • 0x9000 – 0x97FF: Memory mapped I/O
  • 0xA000 – 0xAFFF: Font Ram
  • 0xB000 – 0xBFFF: Text Ram
  • 0xC000 – 0xEFFF: VRAM

Getting back to the start of this post, where we discussed simple command execution, this memory map prints out if the ‘m’ command is entered through the UART. There are two other commands currently, ‘l’ for lighting all LEDs and ‘d’ for darkening them.

There is also a current issue where UART input seems to generate multiple instances of the same character from my read function. I’ve yet to look at that problem in detail just yet.

Wrap Up

So in this part, we’ve investigated some ‘BIOS’ code, modified the branch immediate instruction, added a software interrupt instruction, and fixed some issues in their implementation.

At the moment, I’m finding the hardest part of TPU now bad tooling. TASM for assembling code was good and served its purpose when I needed a quick fix for generating small tests. However, now that I’m thinking of sending program code over UART which is ‘compiled’ as position independent code, TASM is showing significant weaknesses. I’ll need to get around that somehow.

For now that’s the end of this part. Many thanks for reading. As always, if you have any comments of queries, please grab me on twitter @domipheus.

Designing a CPU in VHDL, Part 13: Memory system and BIOS beginnings

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

Now we have text-mode HDMI/DVI-D output, it’s about time we started writing more code for TPU. However, we’ve not delved into too much detail yet about the memory subsystem – the part of the puzzle which reinterprets the various busses from the TPU module in VHDL and manages how data flows between different memories and/or mapped ‘registers’.

TPU memory interface

TPU has an address bus output, a data input bus and a data output bus. Generally CPUs have a single data bus and it’s bidirectional, but I opted for this current setup early on and have stuck with it.

For the most part, memory on the TPU ‘System on Chip’ is made up of Xilinx Block Rams. These are 2KB in size, and dual-ported, allowing them to be used as VRAM and TRAM (for an explanation of TRAM see the previous part in this series of posts). The rest of the memory subsystem is addressing logic for memory mapping the UART, switches, LEDs, and other peripheral I/O.

Due to most memory blocks being 2KB, memory is divided up into 2KB blocks/banks. The address bus selects a bank, which has it’s own chip select line.

	-- Embedded ram
	MEM_CS_ERAM_1 <= '1' when (MEM_BANK_ID = X"0"&'0') else '0'; -- 0x00 bank
	MEM_CS_ERAM_2 <= '1' when (MEM_BANK_ID = X"0"&'1') else '0'; -- 0x08 bank
	MEM_CS_ERAM_3 <= '1' when (MEM_BANK_ID = X"1"&'0') else '0'; -- 0x10 bank
	MEM_CS_ERAM_4 <= '1' when (MEM_BANK_ID = X"1"&'1') else '0'; -- 0x18 bank
	
	-- system i/o maps
	MEM_CS_SYSTEM <= '1' when (MEM_BANK_ID = X"9"&'0') else '0'; -- 0x90 bank
	
	-- 4KB of font bitmap ram
	MEM_CS_FRAM_1 <= '1' when (MEM_BANK_ID = X"A"&'0') else '0'; -- 0xA0 bank
	MEM_CS_FRAM_2 <= '1' when (MEM_BANK_ID = X"A"&'1') else '0'; -- 0xA8 bank
	
	-- 4KB of text character ram
	MEM_CS_TRAM_1 <= '1' when (MEM_BANK_ID = X"B"&'0') else '0'; -- 0xB0 bank
	MEM_CS_TRAM_2 <= '1' when (MEM_BANK_ID = X"B"&'1') else '0'; -- 0xB8 bank

We have signals for the masked off address within any given bank, and the bank ID.

	-- mem brams banks are 2KB. The following addresses within a BRAM
	MEM_2KB_ADDR <= MEM_O_addr and X"07FF";
	MEM_BANK_ID <= MEM_O_addr(15 downto 11);

Then the block rams “ebram” entities are connected as so:

	ebram_1: ebram Port map ( 
    I_clk => cEng_clk_core,
    I_cs => MEM_CS_ERAM_1,
    I_we => MEM_WE,
    I_addr => MEM_2KB_ADDR,
    I_data => MEM_O_data,
    I_size => MEM_REQ_SIZE,
    O_data => MEM_DATA_OUT_ERAM_1
	);

You’ll notice the data output from the ebram is to it’s own signal, MEM_DATA_OUT_ERAM_1. The actual signal that gets selected for input into the TPU core is chosen via a big asynchronous conditional:

	-- select the correct data to send to tpu
	MEM_I_data <= INT_DATA when O_int_ack = '1' 	
	              else  MEM_DATA_OUT_ERAM_1 when MEM_CS_ERAM_1 = '1' 
	              else  MEM_DATA_OUT_ERAM_2 when MEM_CS_ERAM_2 = '1' 
	              else  MEM_DATA_OUT_ERAM_3 when MEM_CS_ERAM_3 = '1' 
	              else  MEM_DATA_OUT_ERAM_4 when MEM_CS_ERAM_4 = '1' 
	              else  MEM_DATA_OUT_ERAM_5 when MEM_CS_ERAM_5 = '1' 
	              else  MEM_DATA_OUT_ERAM_6 when MEM_CS_ERAM_6 = '1' 
	              else  MEM_DATA_OUT_ERAM_7 when MEM_CS_ERAM_7 = '1' 
	              else  MEM_DATA_OUT_ERAM_8 when MEM_CS_ERAM_8 = '1'
					  
	              else  MEM_DATA_OUT_FRAM_1 when MEM_CS_FRAM_1 = '1' 
	              else  MEM_DATA_OUT_FRAM_2 when MEM_CS_FRAM_2 = '1'
					  
	              else  MEM_DATA_OUT_TRAM_1 when MEM_CS_TRAM_1 = '1' 
	              else  MEM_DATA_OUT_TRAM_2 when MEM_CS_TRAM_2 = '1'
					  
	              else  MEM_DATA_OUT_VRAM_1 when MEM_CS_VRAM_1 = '1' 
				        else IO_DATA ;

We could implement all of this with bidirectional/tristate signals, but maybe that’s a discussion for another post. I’ve intentionally kept bidirectional communication to the minimum, as it can easily cause confusing situations.

So you can see it’s fairly easy to move things around and see how to attach block rams into the TPU ‘memory map’. But we also have I/O!

Memory mapped I/O

Part 9 showed how memory mapped I/O was handled, and that does not change at all. There is a process in the top-level module which monitors for memory requests at certain addresses, and manipulates the IO_DATA signal in the case of any memory reads. You can see above that the data into TPU selects IO_DATA when no memory selects or interrupts are active. If we add another peripheral, we simply edit the process hanling this part of memory to update the relevant signals for any given address.

  if MEM_O_addr = X"9000" and MEM_O_we = '1' then
    -- onboard leds
    IO_LEDS <= MEM_O_data(7 downto 0);
  end if;
  
  if MEM_O_addr = X"9001" and MEM_O_we = '0' then
    -- onboard switches
    IO_DATA <= X"000" & IO_SWITCH;
  end if;

Access Violations

You’ll notice that the chip selects above don’t map all banks to block rams just now. It would be useful to know about memory locations that are not currently mapped, and that should be a simple case of making sure a chip select line is active on a memory command. If a chip select is not active, we can assume the address is unmapped, and request an interrupt to the TPU core.

First, we need to OR together all those chip select lines into one allseeing MEM_ANY_CS signal. Then, we check that signal in the I/O handler process – and if it is ever inactive during a memory operation, we know that we’re accessing unmapped memory.

MEM_proc: process(cEng_clk_core)
begin
  if rising_edge(cEng_clk_core) then
  
    if MEM_readyState = 0 then
      if MEM_O_cmd = '1' then
        
        if MEM_ANY_CS = '0' then
          -- a memory command with unmapped memory
          -- throw interrupt
          MEM_Access_error <= '1';
          MEM_Access_error_bank <= MEM_O_addr(15 downto 8);
        end if;
      
        ...snip... 

In the code above, whats missing is that at the end of the memory command, we de-assert the MEM_Access_error signal. This means that if another process sees this MEM_Access_error signal as active, we can use that to request an interrupt.

Memory Interrupt Process

This is what the memory interrupt process checks for, and acts upon.

exception_notifier: process (cEng_clk_core, MEM_Access_error)
begin
  if rising_edge(cEng_clk_core) and MEM_access_int_state = 0 then
    if MEM_Access_error = '1' then
      I_int <= '1';
      MEM_access_int_state <= 1;
      INT_DATA <= X"80"  & MEM_Access_error_bank;
    elsif MEM_access_int_state = 1 and I_int = '1' and O_int_ack = '1' then
      I_int <= '0';
      MEM_access_int_state <= 2; 
    elsif MEM_access_int_state = 2 then
      MEM_access_int_state <= 3; 
    elsif MEM_access_int_state = 3 then
      MEM_access_int_state <= 0;
    end if;
  end if;
  
end process;

This process checks each clock cycle for a memory access error, and if it notices one, it requests an interrupt, and saves the current memory bank into the lower half of the INT_DATA signal. This signal is what becomes the Interrupt Event Field, which is accessible in user code. We set he high byte of this to 0x80 – to identify the interrupt type to the interrupt handler. The rest of the code in this process is simply following the exception workflow – it waits for the interrupt acknowledge, and then waits for cycles of latency before completing.

With these changes, it now means if an un-mapped memory access occurs, our interrupt vector code is called, and when we issue the gief instruction to obtain the Event Field into a register, we’ll be able to know it’s a memory violation – and what bank the attempted access came from.

The beginnings of a BIOS

So, now that we have our text output, our UART, and a decent memory system, it’s time to start implementing a BIOS which we can leverage when building real TPU programs. So far, my bootloader contains several functions:

  • Reset vector function (which just jumps to main)
  • Interrupt vector function handler
  • Main function
  • mul16 – 16 bit multiply
  • div16 – 16 bit divide with remainder
  • putc – put character
  • puts – put string
  • setcursor – set the current cursor location in the textual screen
  • setcolour – set the current color of any characters (Set the attribute byte)
  • cls – clear screen
  • uitoa – unsigned integer to ascii
  • getc – get char from UART

At the moment, when printing the bios header (using a custom glyph set for a TPU logo), the size of this code amounts to around 1.1KB – which is pretty massive when you think about it.

bios_memory

You can see in that above image a memory figure – and this is checked at runtime. It iteratively searches through memory, reading every byte location, until an unmapped memory violation occurs. The memory test assumes that the first contiguous block of ram is the usable memory. With 8 2KB block rams connected to those first bank addresses, we have 16KB to play with.

This memory test really brought back old memories of how long you sometimes had to wait for the memory test to complete. The uitoa function relies heavily on divide/mod operations, and with software-only divide, things are slow. It’s a few seconds to work through that 16KB window. But, I quite like the fact it is slow enough that you can see the searches happen in real time.

With that, I was tempted to integrate a startup beep like old times. And, well, I’m going to do what @mmalex tells me to do in this instance!

The audio is a simple square wave through the headphone jack which I’d forgotten existed on the miniSpartan6+ board. I now have a memory-mapped register which controls it’s operation, allowing you to activate the left or right output channels. I’ll add the ability to control the tone later – for now, it’s just a cool bios beep!

Wrap Up

That pretty much explains the current memory subsystem in a bit more detail, and hopefully shows that TPU is now starting to really behave like an old vintage computer. I aim to develop the BIOS more, and have another post up my sleeve talking about a new instruction, and further BIOS progress.

Thanks for reading, as always let me know your thoughts via twitter @domipheus.

Designing a CPU in VHDL, Part 12: Text mode video output

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

Whilst having a pixel-based video output on TPU is great, there is fundamental limitations with regard to resolutions and memory. It’s very hard to convey real information with such a resolution, and really what I need is the old style text modes of past. Think 80×25 characters, DOS/BIOS post screens. What is needed to implement that sort of output?

First of all, we need to fix down on our ‘text resolutions’. That is, the number of columns/rows, and the size in pixels of the glyphs we will draw. For this, I’m going to continue with 80 columns by 25 rows. This means, if our glyphs are 8×16 pixels, a screen resolution of 640×400 is needed. That fits nicely into 640×480, if you don’t mind a border on the bottom edge – 640x400x70Hz is an option too.

In addition to this, I want to be able to set colours for the text – foreground and background. I’d also want to make blinking of specific characters possible.

Text RAM

The areas of memory where the ASCII characters to render are stored is called TRAM in my design, standing for text ram (not to be confused with the .text executable sections in binaries!). For each character tile on our 80×25 character grid, we will have two bytes – the ASCII character, along with an attribute byte. This attribute byte will define the foreground and background colours for this glyph tile – along with whether or not the tile should be blinking.

80×25 2-byte characters comes out as 4,000 bytes. That will nearly fill two 2KB Xilinx block rams. I could have used 80×30, perfectly filling the whole 640×480 screen resolution, but I couldn’t bring myself to add that third block ram. Despite that, we do have plenty of them available on the miniSpartan6+ board. My LX25 variant has 52 available, for a total of 104KB storage. These block rams are integrated into my top-level TPU design in the same way as the existing VRAM, so they are both readable and writable by TPU, and readable (at a differing clock rate) for use by our new VHDL module which will generate the pixel stream required to represent our text characters.

Font RAM

The glyphs themselves are stored as 16 bytes, with 1 bit corresponding to a pixel in the output. A 1 value indicates foreground shading, whilst 0 is unsurprisingly background.

glyph_layoutWith the glyphs organized linearly as a packed array of 16-byte elements, for the full 256 range of characters, we’ll need exactly two 2KB block rams. This storage could also be implemented as a ROM, but I’m going to go ahead and use the same module I use or the text ram (and VRAM) so that the user can edit the storage to implement custom glyphs.

The character generator – text_gen

In the last part, I introduced a VGA signal generator. This module takes a pixel clock, and generates sync, blanking and an x and y coordinate for the pixel that is being output. This X and Y information is used to generate a memory address, at which VRAM contains the 16-bit 565 colour to output for that pixel. The RGB value then goes to encoders, and serialized out as DVI-D.

With this signal generator, we will first change the timings to output a 640x480x60Hz set of signals. The x and Y output will no longer form an address into vram, but will be passed into a text_gen module. This new module, for a given X and Y, will generate addresses into the text and font rams, manage the operation of the data from those rams, and eventually output a pixel value. This text_gen module needs to operate at a faster clock, as for any pixel, there could be two dependent memory reads issued which need serviced before output is provided.

For each pixel value, the 8×16 text ’tile’ index is calculated. From this, the location in tram is known – a basic tile_y*80+tile_x calculation. In the VHDL, we use the unsigned type which has the multiplication operation defined.

tram_addr_us <= (unsigned( I_y(11 downto 4)) * 80) + unsigned(I_x(11 downto 3));

This synthesizes to a DSP48A1. There are timing considerations here that I need to take into account – more on that later.

dsp48a1The 16-bit data word from TRAM is captured after several cycles of latency. This data is latched within text_gen, and the ASCII character code part of this used to calculate a further address into the font ram. This calculation is easier due to the 16-byte layout, so can be implemented with shifts. After a further few cycles of latency to allow the external memory to respond, we get a single byte equating to a row within the glyph. Using out input X pixel coordinate, we look up the relevant bit in the glyph row – which is then used to select a foreground or background colour. The colours themselves are selected using the other byte obtained from text ram – the attribute byte.

Attribute Byte

The attribute byte layout is the same as other text modes. A single blink bit, 3 bits of background colour and 4 bits of foreground. These could be interpreted in other ways (for example disabling blinking can allow for more background colours) but at the moment they simply index into one of 8 available background colours or 16 available foreground colours. I’ve fixed the colours themselves but there is no reason as to why these colours could not be memory mapped so that the palette can be changed programmatically.

attribute_byteBlinking is achieved by checking an internal counter, along with the blink attribute bit. If the blink bit is set, and the counter is in a non-blink state, the background color is chosen regardless of the glyph properties.

text_gen states


entity text_gen is
  Port ( 
    I_clk_pixel : in  STD_LOGIC;
    I_clk_pixel10x : in  STD_LOGIC;

    -- Inputs from VGA signal generator
    -- defines the 'next pixel' 
    I_blank : in  STD_LOGIC;
    I_x : in  STD_LOGIC_VECTOR (11 downto 0);
    I_y : in  STD_LOGIC_VECTOR (11 downto 0);

    -- Request data for a glyph row from FRAM
    O_FRAM_ADDR : out STD_LOGIC_VECTOR (15 downto 0);
    I_FRAM_DATA : in STD_LOGIC_VECTOR (15 downto 0);

    -- Request data from textual memory TRAM
    O_TRAM_ADDR : out STD_LOGIC_VECTOR (15 downto 0);
    I_TRAM_DATA : in STD_LOGIC_VECTOR (15 downto 0);

    -- The data for the relevant requested pixel
    O_R : out STD_LOGIC_VECTOR (7 downto 0);
    O_G : out STD_LOGIC_VECTOR (7 downto 0);
    O_B : out STD_LOGIC_VECTOR (7 downto 0)
  );
end text_gen;

text_mode_diagramI have the text generator currently running at 10x pixel clock. This is probably being too safe, and I could bring it down to 5x. I’ll have to check the timing constraints more thoroughly.

The module assumes the rows are scanned across the rows just like VGA. Each time a pixel X offset is input which we know is the start of a new glyph, a 2-byte TRAM fetch is initiated. The result of that fetch is used to latch colours from the attribute byte, and then fetch a 1-byte glyph row. That row is latched, and used by the next 8 pixels which are input to the generator. The states are short-circuited to the last stage.

I’ve attached full source of the module below.

Issues

offsetThe first run of the text_gen module was actually very successful. I initialized the text block rams with some characters, and used a font ROM that I found which implemented an ASCII character set. The display worked, albeit with characters in the wrong place. The character I expected to be at position 0,0 was actually in 2,0.

I think there is an issue with timing in terms of how much latency the DSP48 slice needs to perform the multiplication required for calculating the TRAM location. One of the things that I needed to do from the previous part is that we need the next pixel locations to be used, rather than the current pixel which is what is used now. To get around this, I implemented a simple FIFO in the VGA signal generator.

The length of the FIFO can be changed, and the module now outputs a set of signals for the current pixel, which is sent to the TMDS encoders, as well as a set of prefetch signals, which are currently 8 pixels early. These prefetch signals are sent to the text_gen and allow for plenty of time for memory and other latencies to be accounted for. With this change, the output was correct. The expected character in 0,0 was rendered at that location.

Another issue was that the colour of any character was incorrect. For example, the character at position 2,0 had the colour of character 1,0. Moving around the point where the attribute byte and colours were latched in the state machine fixed this. I had been doing lots of asynchronous operations, but performing a latching operation on the RGB pixel data made it much more stable.

Testing out custom glyphs

customglyphOne of the things I wanted was the ability to edit the font ram, and you can do that. Above you will see an image with some odd icon the the right, made up of 4 characters. I don’t really know what it is supposed to look like 🙂

Blinking in action

Wrap up

So text mode works, fairly well. This was a lot easier to get working than I thought it would. I hope to get a small demo together where input from the UART is echoed to this command prompt, and get some simple test commands working.

Thanks for reading, as always let me know your thoughts via twitter @domipheus.

 

----------------------------------------------------------------------------------
-- Company: Domipheus Labs
-- Engineer: Colin Riley
-- 
-- Create Date:    16:27:52 05/01/2016 
-- Design Name:    Text-mode output generator
-- Module Name:    text_gen - Behavioral 
-- Project Name:   
-- Target Devices: Tested on Spartan6
-- Tool versions: 
-- Description: 
--
--   For a 640x480 resolution set of input pixel locations an 80x25 text-mode 
--   representation is generated. It is assumed the x direction pixels are
--   scanned linearly.
--
--   Glyphs are stored in a font ram as 16 bytes, each bit selecting a foreground
--   or background colour to display for a given pizel in an 8x16 glyph.
--
--   A clock faster than the pixel clock is needed to account for latency from 
--   worse-case two dependant memory reads per pixel. It is adviced that pixel 
--   locations are inputted early to the text_gen so data can be prefetched.
--   
--
-- Dependencies: 
--
-- Revision: 
-- Revision 0.01 - File Created
-- Additional Comments: 
--
----------------------------------------------------------------------------------
library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.NUMERIC_STD.ALL;

entity text_gen is
   Port ( 
     I_clk_pixel : in  STD_LOGIC;
     I_clk_pixel10x : in  STD_LOGIC;
     
     -- Inputs from VGA signal generator
     -- defines the 'next pixel' 
     I_blank : in  STD_LOGIC;
     I_x : in  STD_LOGIC_VECTOR (11 downto 0);
     I_y : in  STD_LOGIC_VECTOR (11 downto 0);
     
     -- Request data for a glyph row from FRAM
     O_FRAM_ADDR : out STD_LOGIC_VECTOR (15 downto 0);
     I_FRAM_DATA : in STD_LOGIC_VECTOR (15 downto 0);
     
     -- Request data from textual memory TRAM
     O_TRAM_ADDR : out STD_LOGIC_VECTOR (15 downto 0);
     I_TRAM_DATA : in STD_LOGIC_VECTOR (15 downto 0);
     
     -- The data for the relevant requested pixel
     O_R : out STD_LOGIC_VECTOR (7 downto 0);
     O_G : out STD_LOGIC_VECTOR (7 downto 0);
     O_B : out STD_LOGIC_VECTOR (7 downto 0)
     );
end text_gen;

architecture Behavioral of text_gen is
   -- state tracks the location in our state machine
   signal state: integer := 0;

   -- The blinking speed of characters is controlled by loctions 
   -- in this counter
   signal blinker_count: unsigned(31 downto 0) := X"00000000";

   -- _us is the result of the address computation,
   -- whereas the logic_vector is the latched output to memory
   signal fram_addr_us: unsigned(15 downto 0):= X"0000";
   signal fram_addr: std_logic_vector( 15 downto 0) := X"0000";
   signal fram_data_latched: std_logic_vector(15 downto 0);

   -- Font ram addresses for glyphs above, text ram for ascii and
   -- attributes below.
   signal tram_addr_us: unsigned(15 downto 0):= X"0000";
   signal tram_addr: std_logic_vector( 15 downto 0) := X"0000";
   signal tram_data_latched: std_logic_vector(15 downto 0);

   -- the latched current_x value we are computing
   signal current_x: std_logic_vector( 11 downto 0) := X"FFF";

   -- Current fg and bg colours
   signal colour_fg: std_logic_vector(23 downto 0) := X"FFFFFF"; 
   signal colour_bg: std_logic_vector(23 downto 0) := X"FFFFFF"; 
   signal blink: std_logic := '1';

   -- outputs for our pixel colour
   signal r: std_logic_vector(7 downto 0) := X"00";
   signal g: std_logic_vector(7 downto 0) := X"00";
   signal b: std_logic_vector(7 downto 0) := X"00";

   type colour_rom_t is array (0 to 15) of std_logic_vector(23 downto 0);
   -- ROM definition
   constant colours: colour_rom_t:=(  
   X"000000", -- 0 Black
   X"0000AA", -- 1 Blue
   X"00AA00", -- 2 Green
   X"00AAAA", -- 3 Cyan
   X"AA0000", -- 4 Red
   X"AA00AA", -- 5 Magenta
   X"AA5500", -- 6 Brown
   X"AAAAAA", -- 7 Light Gray
   X"555555", -- 8 Dark Gray
   X"5555FF", -- 9 Light Blue
   X"55FF55", -- a Light Green
   X"55FFFF", -- b Light Cyan
   X"FF5555", -- c Light Red
   X"FF55FF", -- d Light Magenta
   X"FFFF00", -- e Yellow
   X"FFFFFF"  -- f White
   );

begin


   tram_addr <= std_logic_vector(tram_addr_us);
   O_TRAM_ADDR <= tram_addr(14 downto 0) & '0';
   
   
   fram_addr <= std_logic_vector(fram_addr_us);
   O_FRAM_ADDR <= fram_addr(15 downto 0);
   
   process(I_clk_pixel)
   begin
      if rising_edge(I_clk_pixel) then
         blinker_count <= blinker_count + 1;
      end if;
   end process;
               
   process(I_clk_pixel10x)
   begin
      if rising_edge(I_clk_pixel10x) then
         if state < 8 then
            -- each clock either stay in a state, or move to the next one
            state <= state + 1;
         end if;
         
         if state = 3 then
            -- latch the data from TRAM and kick off FRAM read
            tram_data_latched <= I_TRAM_DATA;
            fram_addr_us <= (unsigned(tram_data_latched(7 downto 0)) * 16 ) + unsigned(I_y(3 downto 0));
            blink <= tram_data_latched(15);
            colour_fg <= colours( to_integer(unsigned( tram_data_latched(11 downto 8))));
            colour_bg <= colours( to_integer(unsigned( tram_data_latched(14 downto 12))));
            
         elsif state = 6 then	
            -- latch the data from FRAM
            fram_data_latched <= I_FRAM_DATA;
            state <= 8;
         
         elsif current_x /= I_x then
            if (I_x(2 downto 0) = "000") then
               
               -- Each 8-byte pixel start, set the state and kick off TRAM fetch
               state <= 1;
               -- this multiply becomes a DSP slice
               tram_addr_us <= (unsigned( I_y(11 downto 4)) * 80) + unsigned(I_x(11 downto 3));
            else
               -- short circuit straight to shade state
               state <= 7;
            end if;
            current_x <= I_x;
         
         elsif state >= 8 then
            -- shade a pixel
            
            -- If the curret pixel should be foreground, and is not in a blink state, shade it foreground
            if (fram_data_latched(7 - to_integer(unsigned(I_x(2 downto 0)))) = '1')
               and (blinker_count(24) = '1' or (blink = '0')) then
              
              r <= colour_fg(23 downto 16); 
              g <= colour_fg(15 downto 8);
              b <= colour_fg(7 downto 0);
            else
              r <= colour_bg(23 downto 16);
              g <= colour_bg(15 downto 8);
              b <= colour_bg(7 downto 0);
            end if;
         
         end if;
         
      end if;
   end process;
   
   -- When we are outside of our text area, have black pixels
   O_r <= r when unsigned(I_y) < 400 else X"00";
   O_g <= g when unsigned(I_y) < 400 else X"00";
   O_b <= b when unsigned(I_y) < 400 else X"00";

end Behavioral;