Getting Started with the miniSpartan3 FPGA board

The folks over at Scarab Hardware, who make the miniSpartan6+ board I do most of my FPGA tinkering on, kindly provided me with one of their other devices – the miniSpartan3.

miniSpartan3 is a smaller board, with less features and a Spartan3 Xilinx FPGA instead of the newer generation Spartan6. However, it is very competitively priced, with the board I received costing only $39 – which is a bargain for a small dev board with HDMI out, really.

I thought I’d write a small post about ho to get this board set up and running some “hello world” test. To do this, we need a few things:

  • Xilinx ISE 14 WebPack
  • A HDL design to run
  • A UCF constraints file for the miniSpartan3 board
  • xc3sprog to program the board

Lets get started!

ISE WebPack

This is the official Xilinx development environment for the Spartan3 FPGA family. It has a free version, but is a significant download – you can find it here. The instructions on my “Designing a CPU in VHDL” part are still correct, so you can follow them to obtain the installer.

It’s worth noting that ISE has problems on Windows 10 systems. It installs, but crashes when open file dialogs present. There is a patched set of DLLs which aim to solve this, and a few videos you can find on youtube explaining the steps. I’ve used these and it seems to run fine now.

ise_start

On starting the ISE Project Navigator (I run the 64-bit version) you can start a new project. Throw in a name, location, and optional description. Our top-level source is HDL.

ise_newproj

Now we must specify the settings for the project. The miniSpartan3 board is (unsurprisingly) a Spartan3A family FPGA part. Specifically my board is the XC3S200A part, in the VQ100 package. If you are following me in VHDL, set that as your preferred language and progress to the next screen.

ise_projsettings

The next screen is simply a summary. Ensure here the device and package is correct 🙂

ise_summary

Once the project is created we are presented with an empty project hierarchy. We shall add some new source to it.

ise_new_source

The source we want is a VHDL module. I’ve called mine led_top, as it is going to use the LEDs and is out top-level module.

ise_new_module

ise_define_module

The module will have 2 inputs and one output port. The inputs are the clock, and our two on-board switches. The output is the three LEDs present on the miniSpartan3 board. We can use the new source wizard to define the interface boilerplate for us – the 3 LEDs would be defined as an output bus, with 3 bits – MSB 2 down to LSB 0. Completing the wizard creates source for us and we can then add our additional logic to it.
ise_first_source

Our Hello World project

The end goal for our hello world project is simple:

  • One switch will enable our LED counter output, or fix them all as lit
  • The second switch will select a fast binary count, or slow binary count

The LEDs will count in binary from 0 to 7. Internally, there will be a 32-bit counter, which increments every cycle of our input 32MHz clock. The switch will simply select a different sub-range of bits to view into this 32-bit number. Due to it incrementing every cycle, the windows will be in the high bit range, say bits 24, 25 and 26. This should allow you to see the LEDs move, instead of them being a blur.

To achieve this, we need to introduce our counter. It is a signal within the behavioral led_top architecture definition.

architecture Behavioral of led_top is
  signal count: unsigned(31 downto 0) := X"00000000";
  
  ...snip...

This count signal is of type unsigned, which is compatible with the STD_LOGIC_VECTOR type of our LED output. Unlike STD_LOGIC_VECTOR, it has arithmetic operations defined – which you can use in projects by uncommenting the arithmetic package line near the top of the autogenerated boilerplate.

-- Uncomment the following library declaration if using
-- arithmetic functions with Signed or Unsigned values
use IEEE.NUMERIC_STD.ALL;

The LEDs are a vector of 3 std_logic bits. We will also define a signal to hold their state, which will capture the relevant values of bits in the counter each cycle.

	signal threeBits: std_logic_vector(2 downto 0) := "000";

We then define a process block, which is ‘run’ every time the state of I_clk changes.

  process(I_clk)
  begin
    if rising_edge(I_clk) then
      count <= count + 1;

      if I_switches(1) = '1' then
        threeBits <= std_logic_vector(count(25 downto 23));
      else
        threeBits <= std_logic_vector(count(28 downto 26));
      end if;
    end if;
  end process;

Following this code, each rising edge of the input clock cycle, we will increment the unsigned count signal by 1. Then, depending on the state of our second switch, we will capture three bits of data from that counter, and place them in our ‘threeBits’ signal.

Outside of a process, we have an asynchronous assign to the output LEDs.

O_leds <= threeBits when I_switches(0) = '1' else "111";

This will output the three bits of our counter to the LEDs when our first switch is active, otherwise output a state in which all LEDs are on.

The full source for this small sample, for clarity, is below.

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.NUMERIC_STD.ALL;

entity led_top is
    Port ( I_clk : in  STD_LOGIC;
           O_leds : out  STD_LOGIC_VECTOR (2 downto 0);
        I_switches : in STD_LOGIC_VECTOR (1 downto 0));
end led_top;

architecture Behavioral of led_top is
  signal count: unsigned(31 downto 0) := X"00000000";
  signal threeBits: std_logic_vector(2 downto 0) := "000";
begin

   process(I_clk)
   begin
    if rising_edge(I_clk) then
      count <= count + 1;
     
     if I_switches(1) = '1' then
      threeBits <= std_logic_vector(count(25 downto 23));
     else
      threeBits <= std_logic_vector(count(28 downto 26));
     end if;
    end if;
   end process;

   O_leds <= threeBits when I_switches(0) = '1' else "111";

end Behavioral;

Now that we have a top-level VHDL module for the behavior we require, we need to now ensure those inputs and outputs are mapped to the correct pins on the miniSpartan3 board.

Creating the constraints file

The constraints file contains mappings from net names to pins on our board, along with relevant standards for what kind of I/O is used. We create a User Constraints File (.ucf) using the New Source option as previously used for our VHDL module.

ise_ucf

We need to define mappings for our I_clk, I_switches and O_leds nets to correct pins on the FPGA package. For that, we need to inspect the schematic of the miniSpartan3 board, which is provided on the github project from Scarab Hardware.

pins

We can see from the schematic that our LEDS, switches and clock input are clearly defined. The UCFfile is a list of NET names to LOC pin locations, along with modifiers. For example, we can see the I_clk input is on pin P85. So we define that as so:

# 32 MHz clock
NET "I_clk" LOC = "P85"; 

We can also follow the lines from the LEDs to the FPGA inputs, and get their mappings.

# Leds
NET "O_leds<2>" LOC = "P16";  
NET "O_leds<1>" LOC = "P19";  
NET "O_leds<0>" LOC = "P20"; 

When it comes to the switches however, we need to give some more information. If you look at the schematic, you can see the switches connect directly from the fpga, though the switch, to ground. This means that we need the input to be ‘pulled up’ via an internal resister, to the high logic level. This basically means that unless the input pin is forced to ground, it will be pulled high. In this case, we state thee net is also PULLUP.

# Switches
NET "I_switches<0>" LOC = "P98" | PULLUP;
NET "I_switches<1>" LOC = "P99" | PULLUP;

with those 6 NET definitions in our UCF file, we can now go ahead and create a programming file.

generate

This will create a led_top.bit file in our project folder. We will flash this file onto the miniSpartan3 board.

Flashing the miniSpartan3

Normally you flash a small SPI memory chip which FPGAs read from on power-up, but the easiest way to get your design working is to just flash the FPGA on a 1 time basis. This means you need to re-flash each power cycle and you cannot reset it, but is fine for this Hello World example. To do this, you can use a program called xc3sprog. More information is available here.

“xc3sprog is a suite of utilities for programming Xilinx FPGAs, CPLDs, and EEPROMs with the Xilinx Parallel Cable and other JTAG adapters under Linux.”

Despite the blurb above, it also works under Windows, which is my primary platform just now. You can find a built windows binary on sourceforge.

xc3sprog -c ftdi led_top.bit

xc3sprog

One-time flashes the device, and we get a working Hello World example! Note there is a failure message there, but it works.

Wrap Up

I hope this is useful to folks! If you have any questions, you can find me on twitter @domipheus. This project is available on github.

Designing a CPU in VHDL, Part 13: Memory system and BIOS beginnings

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

Now we have text-mode HDMI/DVI-D output, it’s about time we started writing more code for TPU. However, we’ve not delved into too much detail yet about the memory subsystem – the part of the puzzle which reinterprets the various busses from the TPU module in VHDL and manages how data flows between different memories and/or mapped ‘registers’.

TPU memory interface

TPU has an address bus output, a data input bus and a data output bus. Generally CPUs have a single data bus and it’s bidirectional, but I opted for this current setup early on and have stuck with it.

For the most part, memory on the TPU ‘System on Chip’ is made up of Xilinx Block Rams. These are 2KB in size, and dual-ported, allowing them to be used as VRAM and TRAM (for an explanation of TRAM see the previous part in this series of posts). The rest of the memory subsystem is addressing logic for memory mapping the UART, switches, LEDs, and other peripheral I/O.

Due to most memory blocks being 2KB, memory is divided up into 2KB blocks/banks. The address bus selects a bank, which has it’s own chip select line.

	-- Embedded ram
	MEM_CS_ERAM_1 <= '1' when (MEM_BANK_ID = X"0"&'0') else '0'; -- 0x00 bank
	MEM_CS_ERAM_2 <= '1' when (MEM_BANK_ID = X"0"&'1') else '0'; -- 0x08 bank
	MEM_CS_ERAM_3 <= '1' when (MEM_BANK_ID = X"1"&'0') else '0'; -- 0x10 bank
	MEM_CS_ERAM_4 <= '1' when (MEM_BANK_ID = X"1"&'1') else '0'; -- 0x18 bank
	
	-- system i/o maps
	MEM_CS_SYSTEM <= '1' when (MEM_BANK_ID = X"9"&'0') else '0'; -- 0x90 bank
	
	-- 4KB of font bitmap ram
	MEM_CS_FRAM_1 <= '1' when (MEM_BANK_ID = X"A"&'0') else '0'; -- 0xA0 bank
	MEM_CS_FRAM_2 <= '1' when (MEM_BANK_ID = X"A"&'1') else '0'; -- 0xA8 bank
	
	-- 4KB of text character ram
	MEM_CS_TRAM_1 <= '1' when (MEM_BANK_ID = X"B"&'0') else '0'; -- 0xB0 bank
	MEM_CS_TRAM_2 <= '1' when (MEM_BANK_ID = X"B"&'1') else '0'; -- 0xB8 bank

We have signals for the masked off address within any given bank, and the bank ID.

	-- mem brams banks are 2KB. The following addresses within a BRAM
	MEM_2KB_ADDR <= MEM_O_addr and X"07FF";
	MEM_BANK_ID <= MEM_O_addr(15 downto 11);

Then the block rams “ebram” entities are connected as so:

	ebram_1: ebram Port map ( 
    I_clk => cEng_clk_core,
    I_cs => MEM_CS_ERAM_1,
    I_we => MEM_WE,
    I_addr => MEM_2KB_ADDR,
    I_data => MEM_O_data,
    I_size => MEM_REQ_SIZE,
    O_data => MEM_DATA_OUT_ERAM_1
	);

You’ll notice the data output from the ebram is to it’s own signal, MEM_DATA_OUT_ERAM_1. The actual signal that gets selected for input into the TPU core is chosen via a big asynchronous conditional:

	-- select the correct data to send to tpu
	MEM_I_data <= INT_DATA when O_int_ack = '1' 	
	              else  MEM_DATA_OUT_ERAM_1 when MEM_CS_ERAM_1 = '1' 
	              else  MEM_DATA_OUT_ERAM_2 when MEM_CS_ERAM_2 = '1' 
	              else  MEM_DATA_OUT_ERAM_3 when MEM_CS_ERAM_3 = '1' 
	              else  MEM_DATA_OUT_ERAM_4 when MEM_CS_ERAM_4 = '1' 
	              else  MEM_DATA_OUT_ERAM_5 when MEM_CS_ERAM_5 = '1' 
	              else  MEM_DATA_OUT_ERAM_6 when MEM_CS_ERAM_6 = '1' 
	              else  MEM_DATA_OUT_ERAM_7 when MEM_CS_ERAM_7 = '1' 
	              else  MEM_DATA_OUT_ERAM_8 when MEM_CS_ERAM_8 = '1'
					  
	              else  MEM_DATA_OUT_FRAM_1 when MEM_CS_FRAM_1 = '1' 
	              else  MEM_DATA_OUT_FRAM_2 when MEM_CS_FRAM_2 = '1'
					  
	              else  MEM_DATA_OUT_TRAM_1 when MEM_CS_TRAM_1 = '1' 
	              else  MEM_DATA_OUT_TRAM_2 when MEM_CS_TRAM_2 = '1'
					  
	              else  MEM_DATA_OUT_VRAM_1 when MEM_CS_VRAM_1 = '1' 
				        else IO_DATA ;

We could implement all of this with bidirectional/tristate signals, but maybe that’s a discussion for another post. I’ve intentionally kept bidirectional communication to the minimum, as it can easily cause confusing situations.

So you can see it’s fairly easy to move things around and see how to attach block rams into the TPU ‘memory map’. But we also have I/O!

Memory mapped I/O

Part 9 showed how memory mapped I/O was handled, and that does not change at all. There is a process in the top-level module which monitors for memory requests at certain addresses, and manipulates the IO_DATA signal in the case of any memory reads. You can see above that the data into TPU selects IO_DATA when no memory selects or interrupts are active. If we add another peripheral, we simply edit the process hanling this part of memory to update the relevant signals for any given address.

  if MEM_O_addr = X"9000" and MEM_O_we = '1' then
    -- onboard leds
    IO_LEDS <= MEM_O_data(7 downto 0);
  end if;
  
  if MEM_O_addr = X"9001" and MEM_O_we = '0' then
    -- onboard switches
    IO_DATA <= X"000" & IO_SWITCH;
  end if;

Access Violations

You’ll notice that the chip selects above don’t map all banks to block rams just now. It would be useful to know about memory locations that are not currently mapped, and that should be a simple case of making sure a chip select line is active on a memory command. If a chip select is not active, we can assume the address is unmapped, and request an interrupt to the TPU core.

First, we need to OR together all those chip select lines into one allseeing MEM_ANY_CS signal. Then, we check that signal in the I/O handler process – and if it is ever inactive during a memory operation, we know that we’re accessing unmapped memory.

MEM_proc: process(cEng_clk_core)
begin
  if rising_edge(cEng_clk_core) then
  
    if MEM_readyState = 0 then
      if MEM_O_cmd = '1' then
        
        if MEM_ANY_CS = '0' then
          -- a memory command with unmapped memory
          -- throw interrupt
          MEM_Access_error <= '1';
          MEM_Access_error_bank <= MEM_O_addr(15 downto 8);
        end if;
      
        ...snip... 

In the code above, whats missing is that at the end of the memory command, we de-assert the MEM_Access_error signal. This means that if another process sees this MEM_Access_error signal as active, we can use that to request an interrupt.

Memory Interrupt Process

This is what the memory interrupt process checks for, and acts upon.

exception_notifier: process (cEng_clk_core, MEM_Access_error)
begin
  if rising_edge(cEng_clk_core) and MEM_access_int_state = 0 then
    if MEM_Access_error = '1' then
      I_int <= '1';
      MEM_access_int_state <= 1;
      INT_DATA <= X"80"  & MEM_Access_error_bank;
    elsif MEM_access_int_state = 1 and I_int = '1' and O_int_ack = '1' then
      I_int <= '0';
      MEM_access_int_state <= 2; 
    elsif MEM_access_int_state = 2 then
      MEM_access_int_state <= 3; 
    elsif MEM_access_int_state = 3 then
      MEM_access_int_state <= 0;
    end if;
  end if;
  
end process;

This process checks each clock cycle for a memory access error, and if it notices one, it requests an interrupt, and saves the current memory bank into the lower half of the INT_DATA signal. This signal is what becomes the Interrupt Event Field, which is accessible in user code. We set he high byte of this to 0x80 – to identify the interrupt type to the interrupt handler. The rest of the code in this process is simply following the exception workflow – it waits for the interrupt acknowledge, and then waits for cycles of latency before completing.

With these changes, it now means if an un-mapped memory access occurs, our interrupt vector code is called, and when we issue the gief instruction to obtain the Event Field into a register, we’ll be able to know it’s a memory violation – and what bank the attempted access came from.

The beginnings of a BIOS

So, now that we have our text output, our UART, and a decent memory system, it’s time to start implementing a BIOS which we can leverage when building real TPU programs. So far, my bootloader contains several functions:

  • Reset vector function (which just jumps to main)
  • Interrupt vector function handler
  • Main function
  • mul16 – 16 bit multiply
  • div16 – 16 bit divide with remainder
  • putc – put character
  • puts – put string
  • setcursor – set the current cursor location in the textual screen
  • setcolour – set the current color of any characters (Set the attribute byte)
  • cls – clear screen
  • uitoa – unsigned integer to ascii
  • getc – get char from UART

At the moment, when printing the bios header (using a custom glyph set for a TPU logo), the size of this code amounts to around 1.1KB – which is pretty massive when you think about it.

bios_memory

You can see in that above image a memory figure – and this is checked at runtime. It iteratively searches through memory, reading every byte location, until an unmapped memory violation occurs. The memory test assumes that the first contiguous block of ram is the usable memory. With 8 2KB block rams connected to those first bank addresses, we have 16KB to play with.

This memory test really brought back old memories of how long you sometimes had to wait for the memory test to complete. The uitoa function relies heavily on divide/mod operations, and with software-only divide, things are slow. It’s a few seconds to work through that 16KB window. But, I quite like the fact it is slow enough that you can see the searches happen in real time.

With that, I was tempted to integrate a startup beep like old times. And, well, I’m going to do what @mmalex tells me to do in this instance!

The audio is a simple square wave through the headphone jack which I’d forgotten existed on the miniSpartan6+ board. I now have a memory-mapped register which controls it’s operation, allowing you to activate the left or right output channels. I’ll add the ability to control the tone later – for now, it’s just a cool bios beep!

Wrap Up

That pretty much explains the current memory subsystem in a bit more detail, and hopefully shows that TPU is now starting to really behave like an old vintage computer. I aim to develop the BIOS more, and have another post up my sleeve talking about a new instruction, and further BIOS progress.

Thanks for reading, as always let me know your thoughts via twitter @domipheus.

Designing a CPU in VHDL, Part 12: Text mode video output

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

Whilst having a pixel-based video output on TPU is great, there is fundamental limitations with regard to resolutions and memory. It’s very hard to convey real information with such a resolution, and really what I need is the old style text modes of past. Think 80×25 characters, DOS/BIOS post screens. What is needed to implement that sort of output?

First of all, we need to fix down on our ‘text resolutions’. That is, the number of columns/rows, and the size in pixels of the glyphs we will draw. For this, I’m going to continue with 80 columns by 25 rows. This means, if our glyphs are 8×16 pixels, a screen resolution of 640×400 is needed. That fits nicely into 640×480, if you don’t mind a border on the bottom edge – 640x400x70Hz is an option too.

In addition to this, I want to be able to set colours for the text – foreground and background. I’d also want to make blinking of specific characters possible.

Text RAM

The areas of memory where the ASCII characters to render are stored is called TRAM in my design, standing for text ram (not to be confused with the .text executable sections in binaries!). For each character tile on our 80×25 character grid, we will have two bytes – the ASCII character, along with an attribute byte. This attribute byte will define the foreground and background colours for this glyph tile – along with whether or not the tile should be blinking.

80×25 2-byte characters comes out as 4,000 bytes. That will nearly fill two 2KB Xilinx block rams. I could have used 80×30, perfectly filling the whole 640×480 screen resolution, but I couldn’t bring myself to add that third block ram. Despite that, we do have plenty of them available on the miniSpartan6+ board. My LX25 variant has 52 available, for a total of 104KB storage. These block rams are integrated into my top-level TPU design in the same way as the existing VRAM, so they are both readable and writable by TPU, and readable (at a differing clock rate) for use by our new VHDL module which will generate the pixel stream required to represent our text characters.

Font RAM

The glyphs themselves are stored as 16 bytes, with 1 bit corresponding to a pixel in the output. A 1 value indicates foreground shading, whilst 0 is unsurprisingly background.

glyph_layoutWith the glyphs organized linearly as a packed array of 16-byte elements, for the full 256 range of characters, we’ll need exactly two 2KB block rams. This storage could also be implemented as a ROM, but I’m going to go ahead and use the same module I use or the text ram (and VRAM) so that the user can edit the storage to implement custom glyphs.

The character generator – text_gen

In the last part, I introduced a VGA signal generator. This module takes a pixel clock, and generates sync, blanking and an x and y coordinate for the pixel that is being output. This X and Y information is used to generate a memory address, at which VRAM contains the 16-bit 565 colour to output for that pixel. The RGB value then goes to encoders, and serialized out as DVI-D.

With this signal generator, we will first change the timings to output a 640x480x60Hz set of signals. The x and Y output will no longer form an address into vram, but will be passed into a text_gen module. This new module, for a given X and Y, will generate addresses into the text and font rams, manage the operation of the data from those rams, and eventually output a pixel value. This text_gen module needs to operate at a faster clock, as for any pixel, there could be two dependent memory reads issued which need serviced before output is provided.

For each pixel value, the 8×16 text ’tile’ index is calculated. From this, the location in tram is known – a basic tile_y*80+tile_x calculation. In the VHDL, we use the unsigned type which has the multiplication operation defined.

tram_addr_us <= (unsigned( I_y(11 downto 4)) * 80) + unsigned(I_x(11 downto 3));

This synthesizes to a DSP48A1. There are timing considerations here that I need to take into account – more on that later.

dsp48a1The 16-bit data word from TRAM is captured after several cycles of latency. This data is latched within text_gen, and the ASCII character code part of this used to calculate a further address into the font ram. This calculation is easier due to the 16-byte layout, so can be implemented with shifts. After a further few cycles of latency to allow the external memory to respond, we get a single byte equating to a row within the glyph. Using out input X pixel coordinate, we look up the relevant bit in the glyph row – which is then used to select a foreground or background colour. The colours themselves are selected using the other byte obtained from text ram – the attribute byte.

Attribute Byte

The attribute byte layout is the same as other text modes. A single blink bit, 3 bits of background colour and 4 bits of foreground. These could be interpreted in other ways (for example disabling blinking can allow for more background colours) but at the moment they simply index into one of 8 available background colours or 16 available foreground colours. I’ve fixed the colours themselves but there is no reason as to why these colours could not be memory mapped so that the palette can be changed programmatically.

attribute_byteBlinking is achieved by checking an internal counter, along with the blink attribute bit. If the blink bit is set, and the counter is in a non-blink state, the background color is chosen regardless of the glyph properties.

text_gen states


entity text_gen is
  Port ( 
    I_clk_pixel : in  STD_LOGIC;
    I_clk_pixel10x : in  STD_LOGIC;

    -- Inputs from VGA signal generator
    -- defines the 'next pixel' 
    I_blank : in  STD_LOGIC;
    I_x : in  STD_LOGIC_VECTOR (11 downto 0);
    I_y : in  STD_LOGIC_VECTOR (11 downto 0);

    -- Request data for a glyph row from FRAM
    O_FRAM_ADDR : out STD_LOGIC_VECTOR (15 downto 0);
    I_FRAM_DATA : in STD_LOGIC_VECTOR (15 downto 0);

    -- Request data from textual memory TRAM
    O_TRAM_ADDR : out STD_LOGIC_VECTOR (15 downto 0);
    I_TRAM_DATA : in STD_LOGIC_VECTOR (15 downto 0);

    -- The data for the relevant requested pixel
    O_R : out STD_LOGIC_VECTOR (7 downto 0);
    O_G : out STD_LOGIC_VECTOR (7 downto 0);
    O_B : out STD_LOGIC_VECTOR (7 downto 0)
  );
end text_gen;

text_mode_diagramI have the text generator currently running at 10x pixel clock. This is probably being too safe, and I could bring it down to 5x. I’ll have to check the timing constraints more thoroughly.

The module assumes the rows are scanned across the rows just like VGA. Each time a pixel X offset is input which we know is the start of a new glyph, a 2-byte TRAM fetch is initiated. The result of that fetch is used to latch colours from the attribute byte, and then fetch a 1-byte glyph row. That row is latched, and used by the next 8 pixels which are input to the generator. The states are short-circuited to the last stage.

I’ve attached full source of the module below.

Issues

offsetThe first run of the text_gen module was actually very successful. I initialized the text block rams with some characters, and used a font ROM that I found which implemented an ASCII character set. The display worked, albeit with characters in the wrong place. The character I expected to be at position 0,0 was actually in 2,0.

I think there is an issue with timing in terms of how much latency the DSP48 slice needs to perform the multiplication required for calculating the TRAM location. One of the things that I needed to do from the previous part is that we need the next pixel locations to be used, rather than the current pixel which is what is used now. To get around this, I implemented a simple FIFO in the VGA signal generator.

The length of the FIFO can be changed, and the module now outputs a set of signals for the current pixel, which is sent to the TMDS encoders, as well as a set of prefetch signals, which are currently 8 pixels early. These prefetch signals are sent to the text_gen and allow for plenty of time for memory and other latencies to be accounted for. With this change, the output was correct. The expected character in 0,0 was rendered at that location.

Another issue was that the colour of any character was incorrect. For example, the character at position 2,0 had the colour of character 1,0. Moving around the point where the attribute byte and colours were latched in the state machine fixed this. I had been doing lots of asynchronous operations, but performing a latching operation on the RGB pixel data made it much more stable.

Testing out custom glyphs

customglyphOne of the things I wanted was the ability to edit the font ram, and you can do that. Above you will see an image with some odd icon the the right, made up of 4 characters. I don’t really know what it is supposed to look like 🙂

Blinking in action

Wrap up

So text mode works, fairly well. This was a lot easier to get working than I thought it would. I hope to get a small demo together where input from the UART is echoed to this command prompt, and get some simple test commands working.

Thanks for reading, as always let me know your thoughts via twitter @domipheus.

 

----------------------------------------------------------------------------------
-- Company: Domipheus Labs
-- Engineer: Colin Riley
-- 
-- Create Date:    16:27:52 05/01/2016 
-- Design Name:    Text-mode output generator
-- Module Name:    text_gen - Behavioral 
-- Project Name:   
-- Target Devices: Tested on Spartan6
-- Tool versions: 
-- Description: 
--
--   For a 640x480 resolution set of input pixel locations an 80x25 text-mode 
--   representation is generated. It is assumed the x direction pixels are
--   scanned linearly.
--
--   Glyphs are stored in a font ram as 16 bytes, each bit selecting a foreground
--   or background colour to display for a given pizel in an 8x16 glyph.
--
--   A clock faster than the pixel clock is needed to account for latency from 
--   worse-case two dependant memory reads per pixel. It is adviced that pixel 
--   locations are inputted early to the text_gen so data can be prefetched.
--   
--
-- Dependencies: 
--
-- Revision: 
-- Revision 0.01 - File Created
-- Additional Comments: 
--
----------------------------------------------------------------------------------
library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.NUMERIC_STD.ALL;

entity text_gen is
   Port ( 
     I_clk_pixel : in  STD_LOGIC;
     I_clk_pixel10x : in  STD_LOGIC;
     
     -- Inputs from VGA signal generator
     -- defines the 'next pixel' 
     I_blank : in  STD_LOGIC;
     I_x : in  STD_LOGIC_VECTOR (11 downto 0);
     I_y : in  STD_LOGIC_VECTOR (11 downto 0);
     
     -- Request data for a glyph row from FRAM
     O_FRAM_ADDR : out STD_LOGIC_VECTOR (15 downto 0);
     I_FRAM_DATA : in STD_LOGIC_VECTOR (15 downto 0);
     
     -- Request data from textual memory TRAM
     O_TRAM_ADDR : out STD_LOGIC_VECTOR (15 downto 0);
     I_TRAM_DATA : in STD_LOGIC_VECTOR (15 downto 0);
     
     -- The data for the relevant requested pixel
     O_R : out STD_LOGIC_VECTOR (7 downto 0);
     O_G : out STD_LOGIC_VECTOR (7 downto 0);
     O_B : out STD_LOGIC_VECTOR (7 downto 0)
     );
end text_gen;

architecture Behavioral of text_gen is
   -- state tracks the location in our state machine
   signal state: integer := 0;

   -- The blinking speed of characters is controlled by loctions 
   -- in this counter
   signal blinker_count: unsigned(31 downto 0) := X"00000000";

   -- _us is the result of the address computation,
   -- whereas the logic_vector is the latched output to memory
   signal fram_addr_us: unsigned(15 downto 0):= X"0000";
   signal fram_addr: std_logic_vector( 15 downto 0) := X"0000";
   signal fram_data_latched: std_logic_vector(15 downto 0);

   -- Font ram addresses for glyphs above, text ram for ascii and
   -- attributes below.
   signal tram_addr_us: unsigned(15 downto 0):= X"0000";
   signal tram_addr: std_logic_vector( 15 downto 0) := X"0000";
   signal tram_data_latched: std_logic_vector(15 downto 0);

   -- the latched current_x value we are computing
   signal current_x: std_logic_vector( 11 downto 0) := X"FFF";

   -- Current fg and bg colours
   signal colour_fg: std_logic_vector(23 downto 0) := X"FFFFFF"; 
   signal colour_bg: std_logic_vector(23 downto 0) := X"FFFFFF"; 
   signal blink: std_logic := '1';

   -- outputs for our pixel colour
   signal r: std_logic_vector(7 downto 0) := X"00";
   signal g: std_logic_vector(7 downto 0) := X"00";
   signal b: std_logic_vector(7 downto 0) := X"00";

   type colour_rom_t is array (0 to 15) of std_logic_vector(23 downto 0);
   -- ROM definition
   constant colours: colour_rom_t:=(  
   X"000000", -- 0 Black
   X"0000AA", -- 1 Blue
   X"00AA00", -- 2 Green
   X"00AAAA", -- 3 Cyan
   X"AA0000", -- 4 Red
   X"AA00AA", -- 5 Magenta
   X"AA5500", -- 6 Brown
   X"AAAAAA", -- 7 Light Gray
   X"555555", -- 8 Dark Gray
   X"5555FF", -- 9 Light Blue
   X"55FF55", -- a Light Green
   X"55FFFF", -- b Light Cyan
   X"FF5555", -- c Light Red
   X"FF55FF", -- d Light Magenta
   X"FFFF00", -- e Yellow
   X"FFFFFF"  -- f White
   );

begin


   tram_addr <= std_logic_vector(tram_addr_us);
   O_TRAM_ADDR <= tram_addr(14 downto 0) & '0';
   
   
   fram_addr <= std_logic_vector(fram_addr_us);
   O_FRAM_ADDR <= fram_addr(15 downto 0);
   
   process(I_clk_pixel)
   begin
      if rising_edge(I_clk_pixel) then
         blinker_count <= blinker_count + 1;
      end if;
   end process;
               
   process(I_clk_pixel10x)
   begin
      if rising_edge(I_clk_pixel10x) then
         if state < 8 then
            -- each clock either stay in a state, or move to the next one
            state <= state + 1;
         end if;
         
         if state = 3 then
            -- latch the data from TRAM and kick off FRAM read
            tram_data_latched <= I_TRAM_DATA;
            fram_addr_us <= (unsigned(tram_data_latched(7 downto 0)) * 16 ) + unsigned(I_y(3 downto 0));
            blink <= tram_data_latched(15);
            colour_fg <= colours( to_integer(unsigned( tram_data_latched(11 downto 8))));
            colour_bg <= colours( to_integer(unsigned( tram_data_latched(14 downto 12))));
            
         elsif state = 6 then	
            -- latch the data from FRAM
            fram_data_latched <= I_FRAM_DATA;
            state <= 8;
         
         elsif current_x /= I_x then
            if (I_x(2 downto 0) = "000") then
               
               -- Each 8-byte pixel start, set the state and kick off TRAM fetch
               state <= 1;
               -- this multiply becomes a DSP slice
               tram_addr_us <= (unsigned( I_y(11 downto 4)) * 80) + unsigned(I_x(11 downto 3));
            else
               -- short circuit straight to shade state
               state <= 7;
            end if;
            current_x <= I_x;
         
         elsif state >= 8 then
            -- shade a pixel
            
            -- If the curret pixel should be foreground, and is not in a blink state, shade it foreground
            if (fram_data_latched(7 - to_integer(unsigned(I_x(2 downto 0)))) = '1')
               and (blinker_count(24) = '1' or (blink = '0')) then
              
              r <= colour_fg(23 downto 16); 
              g <= colour_fg(15 downto 8);
              b <= colour_fg(7 downto 0);
            else
              r <= colour_bg(23 downto 16);
              g <= colour_bg(15 downto 8);
              b <= colour_bg(7 downto 0);
            end if;
         
         end if;
         
      end if;
   end process;
   
   -- When we are outside of our text area, have black pixels
   O_r <= r when unsigned(I_y) < 400 else X"00";
   O_g <= g when unsigned(I_y) < 400 else X"00";
   O_b <= b when unsigned(I_y) < 400 else X"00";

end Behavioral;


Designing a CPU in VHDL, Part 11: VRAM and HDMI output

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

I’ve been working towards HDMI output on my TPU SOC, and this week I managed to get enough of something to get pixels (very large pixels!) output to the screen.

The plan was to map an area of memory to a VRAM block, which could be read and written to form the TPU, and also read for the graphics subsystem that would generate the video signals that are to be output.

VRAM

The current ram used in TPU is a block ram primitive entity on Spartan6 – RAMB16BWER. This 16Kbit ram has two ports, which can be run at different clock rates. At the moment, we map this primitive into an ‘ebram’ component, which disables the second port, and services the block ram via the bus signals on TPU. I made a new component, handily named ‘ebram2port’ to expose the second port of the RAMB16BWER instance for read-only use.

entity ebram2port is
  Port (
    // existing 'ebram' TPU interface
    I_clk : in  STD_LOGIC;
    I_cs : in STD_LOGIC;
    I_we : in  STD_LOGIC;
    I_addr : in  STD_LOGIC_VECTOR (15 downto 0);
    I_data : in  STD_LOGIC_VECTOR (15 downto 0);
    I_size : in STD_LOGIC;
    O_data : out  STD_LOGIC_VECTOR (15 downto 0);

    // new read-only secondary port interface
    I_p2_clk : in STD_LOGIC;
    I_p2_cs : in STD_LOGIC;
    I_p2_addr : in  STD_LOGIC_VECTOR (15 downto 0);
    O_p2_data : out  STD_LOGIC_VECTOR (15 downto 0)
  );
end ebram2port;

An instance of eram2port is created in my TPU top-level design. The chip select signals (I_cs and I_p2_cs) are driven via some conditionals which check for address ranges on the TPU output bus.

CS_VRAM <= '1' when ((MEM_O_addr >= X"C000") and (MEM_O_addr < X"C200") and (O_int_ack = '0')) else '0';
CS_ERAM <= '1' when ((MEM_O_addr < X"1000") and (O_int_ack = '0')) else '0';

CS_ERAM being the chip select for the actual embedded ram instance, with the TPU bootloader code.

The input to TPU from our data sources, such as RAM and peripherals, also needs to change.

MEM_I_data <= INT_DATA when O_int_ack = '1'
else ram_output_data when CS_ERAM = '1'
else vram_output_data when CS_VRAM = '1'
else IO_DATA ;

INT_DATA and IO_DATA busses are controlled by other external processes, and thus don’t matter much here. This code is what I’d like auto-generated from my emulator, temu – as it’s the sort of code which when duplicated to the extent I’ll need (for tens of block rams integrated) that human error comes into play. Everyone makes copy and paste errors. Everyone.

The last real item is the address, which is fed into I_addr. This must be modified from the 0xc000 – 0xc200 address that TPU sees to 0x0000 – 0x0200. This is done as you’d expect, by  simply chopping off the high 4 bits.

Now we have a VRAM block integrated into the TPU top-level module, which TPU programs can read and write to via standard memory instructions to our mapped area of memory, but which also has a second port, which can read the same memory but at a different clock rate. The difference in clock rate is the important part in this.

Graphics Output

I have been using Michael Field’s excellent DVID projects for quite a while now, trying to get my head around how HDMI signals are formed. There are four main aspects:

  • DVI is a subset of HDMI.
  • The pixel signals and timing is essentially the same as VGA.
  • The data is encoded as TMDS serial.
  • The data is then sent along 3 differential signal pairs, with a 4th pair for a clock.

My code uses the DVID test project from Michaels Hamsterworks Wiki. I’ve edited some areas of the code, for my own requirements.

VGA timing

VGA timing generally works along the lines of a pixel clock, which is set specifically to allow for the number of pixels required for your resolution to be transmitted within tolerances, along with horizontal and vertical sync signals, and a blanking flag. The pixel data itself can be thought of as a sub-image of a larger set of data which is transmitted, origin in the top left hand corner. The area to the right and bottom which is not part of the original data is ‘blank’.

vga_800x600x60The timings and the durations of these blanking periods all depends on set figures defined by standards. For example, for an 800×600, 60Hz image, the pixel clock is 40MHz. Essentially, each row can be thought of as having around 1056 pixels, with the additional pixels accounting for blanking and sync periods, where the actual pixel value doesn’t matter – it exists only for timing. An example for the resolution above lays out the exact number of pixels in each are, along with time representations.

I have a VGA signal generator, which takes the pixel clocks and counts through the pixels, outputting pixel offsets, sync and blanking bits. Within this VHDL module, the constants for our 800×600 image are as follows:

constant h_rez        : natural := 800;
constant h_sync_start : natural := 800+40;
constant h_sync_end   : natural := 800+40+128;
constant h_max        : natural := 1056;

constant v_rez        : natural := 600;
constant v_sync_start : natural := 600+1;
constant v_sync_end   : natural := 600+1+4;
constant v_max        : natural := 628;

The VGA signal generator is exposed in my TPU design as the following entity.

entity vga_gen is
Port (
pixel_clock     : in STD_LOGIC;

pixel_h : out STD_LOGIC_VECTOR(11 downto 0);
pixel_v : out STD_LOGIC_VECTOR(11 downto 0);

blank   : out STD_LOGIC := '0';
hsync   : out STD_LOGIC := '0';
vsync   : out STD_LOGIC := '0'
);
end vga_gen;

The pixel_h and pixel_v offsets then combine to form an address which can be looked up in VRAM, which holds the pixel data.

Generating the TMDS data

The actual image data we’ll send over the HDMI cable is actually DVI. The Way HDMI and DVI send image data can be pretty much the same. HDMI can carry more varied data, such as sound – but thats really just hidden in the blanking periods of the communicated image.

TMDS (or Transition-minimized differential signalling if you want the full name!) is a method for transmitting serial data at high clock rates over varying length cables. It has methods for reducing the effects of electromagnetic interference. You can read more about it over at Wikipedia.

The main understanding required is that it’s a form of 8b/10b encoding. 8 bits of data are encoded as 10 bits in such a way that the number of transitions to 1 or 0 states are balanced. This allows the DC voltage to be at a sustained average level – which has various benefits.

Michael has a few TMDS encoder modules available on his various projects, going from basic ones which match low-end 3-bit per pixel input to fixed outputs, to a real encoder capable of the full range of 8bit per pixel RGB. I use the full encoder without modifications. A simple flow of how it works is as follows (again, from Wikipedia):

A two-stage process converts an input of 8 bits into a 10 bit code.

    1. In the first stage, the first bit is transformed and each subsequent bit is either XOR or XNOR transformed against the previous bit.
      The encoder chooses between XOR and XNOR by determining which will result in the fewest transitions. The ninth bit encodes which operation was used.
    2. In the second stage, the first eight bits are optionally inverted to even out the balance of ones and zeros and therefore the sustained average DC level; the tenth bit encodes whether this inversion took place.

With this encoder, we can get the 10 bits we then need to serialize across the cable to our monitor.

Serializing the TMDS data

To serialize the TMDS data to our differential output pairs, we use Double Data rate Registers (ODDR2). These registers are implemented as primitives in the VHDL. Using these DDR registers, we only need a serialization clock 5x that of the pixel clock, rather than 10x. There are ‘true’ serialization primitives available on Spartan6, which I may look at later (there is a SERDES example on Hamsterworks for those interested).

oddr2

ODDR2_red   : ODDR2
generic map (
DDR_ALIGNMENT => "C0",
INIT => '0',
SRTYPE => "ASYNC"
)
port map (
Q => red_s,
D0 => shift_red(0),
D1 => shift_red(1),
C0 => clk,
C1 => clk_n,
CE => '1',
R => '0',
S => '0'
);

Each pixel clock, the 10-bit TMDS value for each pixel is latched. Each subsequent cycle of the 5x pixel clock, the TMDS value is shifted 2 bits to the right. The low 2 bits are then fed into the D0 and D1 inputs of our DDR2 register. The clock inputs C0 and C1 are both 5x pixel, so 200MHz, but the C1 clock input is 180 degrees out of phase.

oddr2_waveThe output of this register, red_s, is then fed into an OBUFDS primitive which drives the TMDS pair output, which is connected to the HDMI socket pins on the miniSpartan6+ board.

obufds

OBUFDS_red   : OBUFDS port map (
O  => hdmi_out_p(2),
OB => hdmi_out_n(2),
I  => red_s   );

There is similar for the other 3 channels. It goes in the order 0:Blue, 1:Green, 2:Red, 3:Clock.

Clocking

At the moment my clocking system needs work, but it’s fixed just now to my needs for 800x600x60Hz. For this, the 50MHz miniSpartan6+ input clock is buffered, then input into a PLL which multiplies it by 20 to 800MHz, before dividing it to 40MHz for the pixel clock, and 200MHz for the serial drivers. There is also a second 200MHz output, 180 degrees out of phase, used in the ODDR registers as clk_n.

PLL_BASE_inst : PLL_BASE
generic map (
CLKFBOUT_MULT => 16,           --800MHz
CLKOUT0_DIVIDE => 20,          --40MHz
CLKOUT0_PHASE => 0.0,

CLKOUT1_DIVIDE => 4,           --200MHz
CLKOUT1_PHASE => 0.0,

CLKOUT2_DIVIDE => 4,           --200MHz
CLKOUT2_PHASE => 180.0,

CLK_FEEDBACK => "CLKFBOUT",    -- Clock source to drive CLKFBIN 
CLKIN_PERIOD => 20.0,          -- IMPORTANT! 20.00 = 50MHz
DIVCLK_DIVIDE => 1             -- Division value for all output clocks (1-52)
)
port map (
CLKFBOUT => clk_feedback,
CLKOUT0  => clock_x1_unbuffered,
CLKOUT1  => clock_x5_unbuffered,
CLKOUT2  => clock_x5_180_unbuffered,
CLKOUT3  => open,
CLKOUT4  => open,
CLKOUT5  => open,
LOCKED   => pll_locked,
CLKFBIN  => clk_feedback,
CLKIN    => clk50_buffered,
RST      => '0'
);

As with the 50MHz input, the 3 clock outputs are buffered before being used in the various subsystems. For this the BUFG primitive is used.

bufg

VRAM Interface

At the moment I have the second port on my ‘vram’ instance clocked at 200MHz. The first port, which TPU uses, is clocked at 50MHz. 200MHz within the allowable operating range for the device I’m using, and it seems to work well. At the moment, I’m pretty sure that I am 1 pixel out of phase, but I can fix that later. The address that the VRAM sees is the following

-- generate the vram scan address, forcing reads at 2 byte boundaries
vram_addr <= X"0" &"000" & pixel_v(8 downto 5) & pixel_h(8 downto 5) & '0';

-- Only show 512x512 of the display with our expanded virtual pixels
vram_data <= vram_output_2 when ((pixel_h(11 downto 9) = "000") and (pixel_v(11 downto 9) = "000"))
else X"0000";

The ‘VRAM’ is currently set up to contain a 16×16 image. Tiny, but perfect for what I need just now. The 16-bit pixels are in 565 format, and I trivially expand that to 8-bit for the TMDS encoders.

Now we have an integrated graphics subsystem, albeit one that is very rigid (for now).
tpu_gfx_toplevel

Issues

I currently need to have the following definition in my constraints file for the clock:

CLOCK_DEDICATED_ROUTE = FALSE;

Without it, the VHDL doesn’t route. It compiles and works fine (seemingly) with that included, but I’ve still to nail down exactly what it means, an how to fix it. Currently, my understanding is that when my VHDL is built, the compilers can’t generate clock placement which satisfies all the rules set. It’s something I want to understand further. It could be as simple as missing out some buffers.

There is also a line of pixels to the far right of the displayed screen, which suggests I’m out of phase by one pixel with the memory read results and the VGA signals. This isn’t too bad, so I’ll look at fixing that along when I increase the VRAM size for higher resolution.

Conclusion

This brings this part to a close. We have HDMI output which is the representation of a small VRAM that TPU controls. It’s pretty neat. I hope to increase the resolution of the image from it’s current 16×16 to something more manageable.

The emulator was very useful during this, as it validated my output for me. Ignore the 2 bright green pixels in the superimposed emulator output 😉

hdmioutput

(The HDMI output to the left is actually 16×16, combination of lighting and a bad camera seems to give the impression of 8×8)

Thanks for reading, as always let me know your thoughts via twitter @domipheus. Also, many thanks to Michael Field and his DVID project of which the bulk of this post is derived from.