Designing a CPU in VHDL, Part 11: VRAM and HDMI output

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

I’ve been working towards HDMI output on my TPU SOC, and this week I managed to get enough of something to get pixels (very large pixels!) output to the screen.

The plan was to map an area of memory to a VRAM block, which could be read and written to form the TPU, and also read for the graphics subsystem that would generate the video signals that are to be output.

VRAM

The current ram used in TPU is a block ram primitive entity on Spartan6 – RAMB16BWER. This 16Kbit ram has two ports, which can be run at different clock rates. At the moment, we map this primitive into an ‘ebram’ component, which disables the second port, and services the block ram via the bus signals on TPU. I made a new component, handily named ‘ebram2port’ to expose the second port of the RAMB16BWER instance for read-only use.

entity ebram2port is
  Port (
    // existing 'ebram' TPU interface
    I_clk : in  STD_LOGIC;
    I_cs : in STD_LOGIC;
    I_we : in  STD_LOGIC;
    I_addr : in  STD_LOGIC_VECTOR (15 downto 0);
    I_data : in  STD_LOGIC_VECTOR (15 downto 0);
    I_size : in STD_LOGIC;
    O_data : out  STD_LOGIC_VECTOR (15 downto 0);

    // new read-only secondary port interface
    I_p2_clk : in STD_LOGIC;
    I_p2_cs : in STD_LOGIC;
    I_p2_addr : in  STD_LOGIC_VECTOR (15 downto 0);
    O_p2_data : out  STD_LOGIC_VECTOR (15 downto 0)
  );
end ebram2port;

An instance of eram2port is created in my TPU top-level design. The chip select signals (I_cs and I_p2_cs) are driven via some conditionals which check for address ranges on the TPU output bus.

CS_VRAM <= '1' when ((MEM_O_addr >= X"C000") and (MEM_O_addr < X"C200") and (O_int_ack = '0')) else '0';
CS_ERAM <= '1' when ((MEM_O_addr < X"1000") and (O_int_ack = '0')) else '0';

CS_ERAM being the chip select for the actual embedded ram instance, with the TPU bootloader code.

The input to TPU from our data sources, such as RAM and peripherals, also needs to change.

MEM_I_data <= INT_DATA when O_int_ack = '1'
else ram_output_data when CS_ERAM = '1'
else vram_output_data when CS_VRAM = '1'
else IO_DATA ;

INT_DATA and IO_DATA busses are controlled by other external processes, and thus don’t matter much here. This code is what I’d like auto-generated from my emulator, temu – as it’s the sort of code which when duplicated to the extent I’ll need (for tens of block rams integrated) that human error comes into play. Everyone makes copy and paste errors. Everyone.

The last real item is the address, which is fed into I_addr. This must be modified from the 0xc000 – 0xc200 address that TPU sees to 0x0000 – 0x0200. This is done as you’d expect, by  simply chopping off the high 4 bits.

Now we have a VRAM block integrated into the TPU top-level module, which TPU programs can read and write to via standard memory instructions to our mapped area of memory, but which also has a second port, which can read the same memory but at a different clock rate. The difference in clock rate is the important part in this.

Graphics Output

I have been using Michael Field’s excellent DVID projects for quite a while now, trying to get my head around how HDMI signals are formed. There are four main aspects:

  • DVI is a subset of HDMI.
  • The pixel signals and timing is essentially the same as VGA.
  • The data is encoded as TMDS serial.
  • The data is then sent along 3 differential signal pairs, with a 4th pair for a clock.

My code uses the DVID test project from Michaels Hamsterworks Wiki. I’ve edited some areas of the code, for my own requirements.

VGA timing

VGA timing generally works along the lines of a pixel clock, which is set specifically to allow for the number of pixels required for your resolution to be transmitted within tolerances, along with horizontal and vertical sync signals, and a blanking flag. The pixel data itself can be thought of as a sub-image of a larger set of data which is transmitted, origin in the top left hand corner. The area to the right and bottom which is not part of the original data is ‘blank’.

vga_800x600x60The timings and the durations of these blanking periods all depends on set figures defined by standards. For example, for an 800×600, 60Hz image, the pixel clock is 40MHz. Essentially, each row can be thought of as having around 1056 pixels, with the additional pixels accounting for blanking and sync periods, where the actual pixel value doesn’t matter – it exists only for timing. An example for the resolution above lays out the exact number of pixels in each are, along with time representations.

I have a VGA signal generator, which takes the pixel clocks and counts through the pixels, outputting pixel offsets, sync and blanking bits. Within this VHDL module, the constants for our 800×600 image are as follows:

constant h_rez        : natural := 800;
constant h_sync_start : natural := 800+40;
constant h_sync_end   : natural := 800+40+128;
constant h_max        : natural := 1056;

constant v_rez        : natural := 600;
constant v_sync_start : natural := 600+1;
constant v_sync_end   : natural := 600+1+4;
constant v_max        : natural := 628;

The VGA signal generator is exposed in my TPU design as the following entity.

entity vga_gen is
Port (
pixel_clock     : in STD_LOGIC;

pixel_h : out STD_LOGIC_VECTOR(11 downto 0);
pixel_v : out STD_LOGIC_VECTOR(11 downto 0);

blank   : out STD_LOGIC := '0';
hsync   : out STD_LOGIC := '0';
vsync   : out STD_LOGIC := '0'
);
end vga_gen;

The pixel_h and pixel_v offsets then combine to form an address which can be looked up in VRAM, which holds the pixel data.

Generating the TMDS data

The actual image data we’ll send over the HDMI cable is actually DVI. The Way HDMI and DVI send image data can be pretty much the same. HDMI can carry more varied data, such as sound – but thats really just hidden in the blanking periods of the communicated image.

TMDS (or Transition-minimized differential signalling if you want the full name!) is a method for transmitting serial data at high clock rates over varying length cables. It has methods for reducing the effects of electromagnetic interference. You can read more about it over at Wikipedia.

The main understanding required is that it’s a form of 8b/10b encoding. 8 bits of data are encoded as 10 bits in such a way that the number of transitions to 1 or 0 states are balanced. This allows the DC voltage to be at a sustained average level – which has various benefits.

Michael has a few TMDS encoder modules available on his various projects, going from basic ones which match low-end 3-bit per pixel input to fixed outputs, to a real encoder capable of the full range of 8bit per pixel RGB. I use the full encoder without modifications. A simple flow of how it works is as follows (again, from Wikipedia):

A two-stage process converts an input of 8 bits into a 10 bit code.

    1. In the first stage, the first bit is transformed and each subsequent bit is either XOR or XNOR transformed against the previous bit.
      The encoder chooses between XOR and XNOR by determining which will result in the fewest transitions. The ninth bit encodes which operation was used.
    2. In the second stage, the first eight bits are optionally inverted to even out the balance of ones and zeros and therefore the sustained average DC level; the tenth bit encodes whether this inversion took place.

With this encoder, we can get the 10 bits we then need to serialize across the cable to our monitor.

Serializing the TMDS data

To serialize the TMDS data to our differential output pairs, we use Double Data rate Registers (ODDR2). These registers are implemented as primitives in the VHDL. Using these DDR registers, we only need a serialization clock 5x that of the pixel clock, rather than 10x. There are ‘true’ serialization primitives available on Spartan6, which I may look at later (there is a SERDES example on Hamsterworks for those interested).

oddr2

ODDR2_red   : ODDR2
generic map (
DDR_ALIGNMENT => "C0",
INIT => '0',
SRTYPE => "ASYNC"
)
port map (
Q => red_s,
D0 => shift_red(0),
D1 => shift_red(1),
C0 => clk,
C1 => clk_n,
CE => '1',
R => '0',
S => '0'
);

Each pixel clock, the 10-bit TMDS value for each pixel is latched. Each subsequent cycle of the 5x pixel clock, the TMDS value is shifted 2 bits to the right. The low 2 bits are then fed into the D0 and D1 inputs of our DDR2 register. The clock inputs C0 and C1 are both 5x pixel, so 200MHz, but the C1 clock input is 180 degrees out of phase.

oddr2_waveThe output of this register, red_s, is then fed into an OBUFDS primitive which drives the TMDS pair output, which is connected to the HDMI socket pins on the miniSpartan6+ board.

obufds

OBUFDS_red   : OBUFDS port map (
O  => hdmi_out_p(2),
OB => hdmi_out_n(2),
I  => red_s   );

There is similar for the other 3 channels. It goes in the order 0:Blue, 1:Green, 2:Red, 3:Clock.

Clocking

At the moment my clocking system needs work, but it’s fixed just now to my needs for 800x600x60Hz. For this, the 50MHz miniSpartan6+ input clock is buffered, then input into a PLL which multiplies it by 20 to 800MHz, before dividing it to 40MHz for the pixel clock, and 200MHz for the serial drivers. There is also a second 200MHz output, 180 degrees out of phase, used in the ODDR registers as clk_n.

PLL_BASE_inst : PLL_BASE
generic map (
CLKFBOUT_MULT => 16,           --800MHz
CLKOUT0_DIVIDE => 20,          --40MHz
CLKOUT0_PHASE => 0.0,

CLKOUT1_DIVIDE => 4,           --200MHz
CLKOUT1_PHASE => 0.0,

CLKOUT2_DIVIDE => 4,           --200MHz
CLKOUT2_PHASE => 180.0,

CLK_FEEDBACK => "CLKFBOUT",    -- Clock source to drive CLKFBIN 
CLKIN_PERIOD => 20.0,          -- IMPORTANT! 20.00 = 50MHz
DIVCLK_DIVIDE => 1             -- Division value for all output clocks (1-52)
)
port map (
CLKFBOUT => clk_feedback,
CLKOUT0  => clock_x1_unbuffered,
CLKOUT1  => clock_x5_unbuffered,
CLKOUT2  => clock_x5_180_unbuffered,
CLKOUT3  => open,
CLKOUT4  => open,
CLKOUT5  => open,
LOCKED   => pll_locked,
CLKFBIN  => clk_feedback,
CLKIN    => clk50_buffered,
RST      => '0'
);

As with the 50MHz input, the 3 clock outputs are buffered before being used in the various subsystems. For this the BUFG primitive is used.

bufg

VRAM Interface

At the moment I have the second port on my ‘vram’ instance clocked at 200MHz. The first port, which TPU uses, is clocked at 50MHz. 200MHz within the allowable operating range for the device I’m using, and it seems to work well. At the moment, I’m pretty sure that I am 1 pixel out of phase, but I can fix that later. The address that the VRAM sees is the following

-- generate the vram scan address, forcing reads at 2 byte boundaries
vram_addr <= X"0" &"000" & pixel_v(8 downto 5) & pixel_h(8 downto 5) & '0';

-- Only show 512x512 of the display with our expanded virtual pixels
vram_data <= vram_output_2 when ((pixel_h(11 downto 9) = "000") and (pixel_v(11 downto 9) = "000"))
else X"0000";

The ‘VRAM’ is currently set up to contain a 16×16 image. Tiny, but perfect for what I need just now. The 16-bit pixels are in 565 format, and I trivially expand that to 8-bit for the TMDS encoders.

Now we have an integrated graphics subsystem, albeit one that is very rigid (for now).
tpu_gfx_toplevel

Issues

I currently need to have the following definition in my constraints file for the clock:

CLOCK_DEDICATED_ROUTE = FALSE;

Without it, the VHDL doesn’t route. It compiles and works fine (seemingly) with that included, but I’ve still to nail down exactly what it means, an how to fix it. Currently, my understanding is that when my VHDL is built, the compilers can’t generate clock placement which satisfies all the rules set. It’s something I want to understand further. It could be as simple as missing out some buffers.

There is also a line of pixels to the far right of the displayed screen, which suggests I’m out of phase by one pixel with the memory read results and the VGA signals. This isn’t too bad, so I’ll look at fixing that along when I increase the VRAM size for higher resolution.

Conclusion

This brings this part to a close. We have HDMI output which is the representation of a small VRAM that TPU controls. It’s pretty neat. I hope to increase the resolution of the image from it’s current 16×16 to something more manageable.

The emulator was very useful during this, as it validated my output for me. Ignore the 2 bright green pixels in the superimposed emulator output 😉

hdmioutput

(The HDMI output to the left is actually 16×16, combination of lighting and a bad camera seems to give the impression of 8×8)

Thanks for reading, as always let me know your thoughts via twitter @domipheus. Also, many thanks to Michael Field and his DVID project of which the bulk of this post is derived from.

 

Dear ImGui, Thanks, From TEMU – The TPU Emulator

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

A few weeks ago I was in San Francisco for the Game Developers Conference (GDC). I decided not to take my MiniSpartan6+ board with me, despite wanting to get more work on TPU completed. Bare circuit boards don’t look good in luggage, etc.

I did however have an idea on the flight over from London Heathrow, so created a new Visual Studio project: temu, the TPU Emulator. I’ve been working on HDMI output of a small framebuffer in the VHDL, so thought I could make a compact emulator which would draw the framebuffer to a window whilst executing some TPU code. This would allow me to get some demos up and running quickly, so that when I went back to the VHDL, I could hit the ground running.

consoleoutAnnoyingly, I forgot to bring the latest TPU ISA document with me on my trip, and the version on GitHub is hilariously out of date. I did, however, have the latest HDL and the TPU assembler, tasm, so I worked backwards from there. The emulator itself is basic – I implemented the most important instructions, ignoring some of the smaller flags (such as signed addition, I don’t need any of the status flags of that for now). The emulator used stdout and just printed the current PC with the instruction it emulated, and in the case of memory operations output the values written along with the locations. I used SDL to open a window, and each frame processed the data in my ‘vram’ memory, writing pixels directly to the surface.

The main function simply had the standard SDL loop, with a WriteSurface() call which trivially wrote pixel values to the window, converting from my 565 16-bit pixel format to what is required of the SDL surface. The emulation happened in another thread. That other thread just loops until exit performing the emulation of current pc, then setting the next pc. There is no synchronization required, as the main thread simply reads the fake vram (simply a char array). The thread is simply spawned, and allowed to emulate alongside the SDL loop.

justvramIt worked well, until it didn’t. It wasn’t a threading issue, it worked great. The issue is I wanted more. I needed a tool, more than simply an emulator. I wanted single step, memory inspection. I wanted a debugger inside my emulator.

Dear ImGui

I had the emulator working by the time GDC began proper, and was wondering what the next steps would be to make it more usable. I knew I needed some sort of real-time interaction, but didn’t want all the hassle of implementing that myself. I was at a GDC breakfast gathering of developers, and Omar (@ocornut) was there, who created a library I’d heard of but never used before: Dear ImGui. It is a cross-platform Immediate Mode GUI library. The answer I was looking for was staring me in the face. Switch SDL to OpenGL mode, and use Dear ImGUI to add an interactive interface to my emulator.

Later in the day when I had some spare time I tried to get things working. Within 15 minutes (and I really do mean only 15 minutes) I’d downloaded the sources required, integrated them into my temu project, and got a window showing FPS and the representation of my VRAM. Most of those 15 minutes, may I add, was looking up how to create and re-upload texture data in OpenGL – its been so long since I’d done any OpenGL coding I’d forgot the basics.

The architecture of the emulator (if you can call it that) is still as before: The SDL OpenGL loop and ‘vram’ texture update remains in the main thread, with the emulation happening in another. I keep to the single producer single consumer with no timing or order constraints, so no synchronization is needed.

vramThe code for my VRAM window is trivial. It’s ever so slightly changed from a sample, used to show drawing an image. It fits my requirements perfectly.

void DrawWindow_Vram()
{
  if (window_vram) {
    ImGui::Begin("VRAM Representation", &window_vram);
    ImGui::Image(ImTextureID(textureID), ImVec2(VIRTUAL_SCREEN_HEIGHT, VIRTUAL_SCREEN_WIDTH));
    ImGui::Text("Application average %.3f ms/frame (%.1f FPS)", 1000.0f / ImGui::GetIO().Framerate, ImGui::GetIO().Framerate);
    ImGui::End();
  }
}

This code is called each frame. All I need to do is set window_vram to true, and the window is shown. Pressing the close window button will set the bool pointed to in the second argument of Begin() to false, which closes the window. A button elsewhere can set window_vram back to true to show the window again. All ‘widgets’ within the Begin and End calls are ordered sequentially, and there are constructs for ordering things in columns, frames, lines etc. It’s all very straightforward. There is a large demo here which shows the basics and advanced layouts and how they fit together.

This library is brilliant. As someone who has written GUI and UI system code in the past, this has the concept nailed.

Once I realized how easy it was to add different windows, a control window popped up, allowing me to stop execution, resume it, and single-step. It also controlled an arbitrary millisecond delay before each instruction was emulated.

control


void DrawWindow_Control()
{
  if (window_control)
  {
    ImGui::Begin("TPU control", &window_control, ImGuiWindowFlags_MenuBar);
    ImGui::BeginMenuBar();
    ImGui::Text("Status: %s", (gStatePaused) ? ((gStateSingleStepping) ? "Single Stepping":"Stopped") : "Running...");
    ImGui::EndMenuBar();
    		
    if (!gStatePaused)
    {
      if (ImGui::Button("Stop")) 
      {
        gStatePaused = true;
      }
    }
    else
    {
      if (ImGui::Button("Continue")) 
      {
        gStatePaused = false;
      }
    }
    ImGui::SameLine();
    if (ImGui::Button("Single Step"))
    {
      gStateSingleStepping = true;
    }

    ImGui::SliderInt("Cycle Delay (ms)", &gSeepTime, 0, 100);
    ImGui::Checkbox("Status Prints (stdout)", &gStatusPrints);
    ImGui::End();
  }
}

This rocks.

startAfter this, I thought of what else to do. So I added quite a lot of extra functionality. Im just going to go through the windows that now exist in the TPU emulator, giving a bit of blurb about each.

Registers

registersThe registers window is pretty obvious. I’d like to add to it, including some internal registers – like status, which tracks overflow and carry style information.

Memory Map

memory_mapThe memory map/layout is essential information when developing TPU programs, since it can change so often when the top-level VHDL is edited. Things like block rams and memory-mapped registers/signals, like what is used for button inputs and LED output, need to be placed in the correct memory-mapped offsets. This window serves this purpose. At the moment, it’s fixed, but it’s designed to be extensible, in that you can add another block ram at a certain place, define whether its read/write or read and write, and then view how it fits into the whole scheme of things. A further feature I’d like to implement is a generation button which outputs top-level vhdl for a memory management unit, which will take the memory bus signals from TPU and map that to the various areas, providing the relevant clock enables and other state.

Memory Viewer

memoy_viewFrom the Memory Layout window, clicking view on any mapped element opens a memory viewer. This widget is available on the ImGui wiki on github.

Interactive Assembler

assemblerThe interactive assembler allows the TPU code to be updated within the emulator, at runtime. There is a small multi-line text box, containing the contents of an assembly file. There is an assemble button which when clicked invokes TASM on the file, opting to output as a binary blob, rather than the VHDL block-ram initializer format I usually use. This binary is then loaded into ram starting at address 0x0000. This address (at present) is fixed, more due to limitations in TASM than anything else.

This allows some really easy development. I can change constants and swap instructions on the fly without even resetting the emulator state, so in the case of drawing colours to the screen the results are instantly visible. For other changes it’s wise to either pause execution, or reset the TPU (Set the PC to 0 and re-initialize all memories), but it’s still miles away from my previous workflow which was insanely tedious, involving VHDL synthesis steps which take minutes.

Buttons

buttons Buttons! There needed to be input somewhere, and there are 4 input switches on my miniSpartan6+ board. I decided to expand that for the emulator, simply due to the fact that I can add more I/O to the hardware device anyway. The buttons are memory mapped to a single 16-bit word, and there is a window showing a checkbox for each bit.

Leds

ledsAs with the buttons, LEDs provide quick and easy visible status output on my development board, and so they have a window in the emulator. 8 LEDS map to a single byte in the memory space. The code for it, again, is trivial.


void DrawWindow_OutputLeds()
{
  // Storage for the led status values, ImGui expects bools
  static bool ledvals[8];

  if (window_outputleds) {
    // Expand the bit values in the memory_leds mapped area to booleans
    for (uint32_t i = 0; i < 8; i++)
    {
      ledvals[i] = (memory_leds & (1 << i))?true:false;
    }
  
    ImGui::Begin("Output Leds", &window_outputleds);
    ImGui::Text("Mapping: 0x%04X", memory_mapping_leds);
  
    // Arrange the checkboxes and text labels in 8 columns
    ImGui::Columns(8, 0, false);
    // Make the check colour green, a much better LED colour
    ImGui::PushStyleColor(ImGuiCol_CheckMark, ImVec4(0, 1, 0, 1));
  
    for (int i = 0; i < 8; i++) {
      ImGui::PushID(i);
      ImGui::Text("%d", 7-i);
      ImGui::Checkbox("", &ledvals[7-i]);
      ImGui::NextColumn();
      ImGui::PopID();
    }
  
    // undo the change to the checkmark colour
    ImGui::PopStyleColor();
    ImGui::End();
  }
}

Interrupt Firer

interrupt_firerThe next window is for testing interrupt routines. You can set what the Interrupt Event Field value will be (what is read externally off the data bus in the hardware on a interrupt ACK) and then signal to the emulator that an interrupt is waiting to be requested. If interrupts are disabled in the current emulator state, this can be forced to enabled from here, and a small status console shows the stage of the interrupt, such as ‘Waiting for ACK’.

Breakpoint Set

breakpointI mentioned debuggers earlier, and this window allows the emulator to get closer to that goal. A single breakpoint can be set by PC value, and enabled. There is nothing more than that just now – although it can be easily expanded to multiple breakpoints. An issue in the emulator at the moment is one which crops up in real debuggers, in that to continue from the breakpoint you need to disable the breakpoint, single step, re-enable the breakpoint, and only then continue. Something to fix later, when it causes me more of a headache than it does now.

PC Hit Counts

histogramAnother window came about from me looking over the ImGui demo, and trying to see what built-in widgets could add some simple functionality. The histogram seemed a winner almost instantly, and it proved itself when I came to try out the interrupt firer feature. The emulator, when executing instructions, grows an std::vector to accommodate the location of the PC. It then increments the value in the bin located at that PC. With this, the histogram is implemented with a single library function:

if (pc_hitcounts.size() > 0) {
  ImGui::PlotHistogram("##histo", &pc_hitcounts[0], pc_hitcounts.size(), 0, nullptr, 0.0f, pc_maxhits / hist_zoom, ImVec2(0, 140));
}

It’s basically a PC hit counter, so is a very simple profiler. Your hot code is the high bars, but you can also use it to identify code coverage issues. For instance, ensuring that the exception handler code you’ve written actually gets executed when you fire an interrupt.

I made a simple video of me using TEMU, showing some of the features and how they work and help things. It’s really helped speed up how quickly I can write TPU assembly, which in turn means I can develop new hardware features quicker. I aim to fix the emulator up (things like project files don’t really exist just no) and upload it to github eventually.

Thanks for reading! Let me know what you think by messaging me on twitter @domipheus, and generally pour praise in the direction of @ocornut for his wonderful library – you can support him via Patreon 🙂

Designing a CPU in VHDL, Part 10b: A very irritating issue, resolved.

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

It’s been a significant amount of time between this post and my last TPU article. A variety of things caused this – mainly working on a few other projects – but also due to an issue I had with TPU itself.

I had been hoping to interface TPU with an ESP8266 Wifi module, using the UART. For those not aware, the ESP8266 is a nifty little device comprising of a chipset containing the wifi radio but also a microcontroller handling the full TCP/IP stack. They are very simple to interface over a UART, working with simple AT commands. You can connect to a wifi network and spawn a server listening for requests in a few string operations – very powerful.

I started writing in TPU assembly the various code areas I’d need to interface with ESP8266. They were mainly string operations over the UART – so, things like SendString, RecieveString, and ExpectString, where the function waits until it matches a string sequence received from the UART. The code and data strings needed for a very simple test is well over 1KB which was a lot of code for this. I created various test benches and the code eventually passed those tests in the simulator.

At this point, I thought things were working well. However, on flashing the miniSpartan6+ board with my new programming file, nothing worked. I could tell the CPU was running something, but it was not the behavior I expected and integrated into the embedded block ram for execution.

When this happens, I usually expect that I’ve done something stupid in the VHDL and so I create post-translate simulation models of my design. This basically spits out further VHDL source which represents TPUs design but using general FPGA constructs, after compilation steps. In the software sense, imagine compiling C++ to C source – you’d see the internals of how virtual function calls work, and how the code increases in size with various operations. You can then simulate these post-translate models, which (in theory) give more accurate results.

The simulation completed (and took a lot longer than normal simulation), and the odd behavior persisted. So standard behavioral simulation worked, and post-translate model simulation failed – just like when on device. This is good, we can reproduce in the simulator.

Looking at the waveforms in the simulator, I could see what was going on: my code in the block ram was being corrupted somehow. When simulating my normal code, the waveform was as follows:

good_sim_run_8012The important part of the waveform is the block of 8012 on mem_i_data, bottom right of the image. That is the value of location 0x001c, as set in the block ram. However, when running my post-translate model, the following result occurred:

bad_sim_run_0012The 0x8012 data is now 0x0012. The high byte has been reset/written/removed. The code itself was setting the high-order memory mapped UART address, so that failing explains why the TPU never worked with the ESP8266 chip.

  ...
  write.w r0, r1, 5
   
  ##Send command to uart 0x12
  load.h  r0, 0x12
  ...

The code itself above performs a write, before loading 0x12 into the high byte of r0. you can see from the simulation waveform that the write enable goes high – this is from the write.w instruction. The instructions following a write were having any memory address destination location overwritten.

It took far too long to realise what was causing it. Looking at the simulation waveforms for both behavioral and post-translate the actions leading up to the corruption seemed identical. The memory bus signals were all changing at the same cycle and being sampled at the same time, but, as the issue always happened with a write operation, attention was drawn to the write enable line. I tried various things at this point, spending an embarrassing amount of time on it.

The issue was staring me in the face, and in both waveforms – albeit not completely surfacing. When I had my previous embedded RAM implementation (before I moved to RAMB16BWER block ram) I sampled all inputs differently, only on the CMD line going active. The block rams were connected directly. The chip select signals are generated from the address which is active longer than needs be, and also, more importantly, the write enable remains active for a whole memory ‘cycle’. The remedy was to feed the block ram write enable with (WE or CMD) from the TPU output. This means WE only goes active for a single cycle. as with CMD.

Don’t keep your write enables active longer than needed, folks!

TPU is continuing very slowly, but progress is being made. Whilst trying to fix this issue I also integrated the UARTs that Xilinx provide for use. They have 16 entry FIFO buffers on both receive and transmit. This may help when interfacing with various other items, like the ESP8266. I’m still interested as to why the simulator didn’t show any differences in signal timing in the post-translate mode, despite being able to reproduce the issue. If anyone knows hints and tips for figuring out issues in this area, please let me know! This issue should really have been noticed by me sooner, though.

Thanks for reading, as always.

Designing a CPU in VHDL, Part 10: Interrupts and Xilinx block RAMs

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

Part 10 was supposed to be a very big part, with a special surprise of TPU working with a cool peripheral device, but that work is still ongoing. It’s taking a long time to do, mostly due to being busy myself over the past few weeks. However, in this update, I’ll look at bringing interrupts to TPU, as well as fixing an issue with the embedded ram that was causing bloating of the synthesized design.

Interrupts

Interrupts are needed on a CPU which is expected to work with multiple asynchronous devices whilst also doing some other computation. You can always have the CPU poll, but sometimes that isn’t wise and/or suitable given other constraints. It’s also good for keeping time with something – vsync, for example. This is where interrupts come in – where a signal fed to the CPU externally can “interrupt” what the CPU is currently executing, and perform some other computation before returning to it’s previous task.

The way I have implemented the interrupts is similar to the Z80 maskable interrupts, with an external interrupt input and an interrupt acknowledge output. The system is simplified and doesn’t have the different types of modes and non-maskable interrupts available on the Z80 but it should be enough for the needs of TPU. You can only handle a single request at a time, and there is only one mode to work with – but it’s powerful enough for most situations.

An overview of how the interrupts will work are as follows:

  • At some point during execution, the system will make the interrupt input to TPU high, indicating they want the interrupt handler run.
  • At the next writeback stage of the pipeline, just before migrating to the fetch stage, the interrupt input is sampled.
  • If an interrupt is requested, the control unit will then make the interrupt acknowledge output from TPU active.
  • Once the interrupt ACK signal is seen externally to TPU, 16-bits of data can be placed on the data input to TPU.
  • After a predetermined number of cycles, the bits on the data in bus are stored.
  • The ACK is de-asserted, and the PC of TPU is set to the interrupt handler.
  • The handler can retrieve the data from the data bus via a new instruction, and also return to the previous PC before the interrupt was acknowledged.
  • The external interrupt input is latched, so until it goes inactive for a cycle, remaining active will not invoke another interrupt handler invocation.

It’s very important that the interrupt input is only acted upon during the end of the writeback stage. Doing it at any other point can result in an inconsistent execution state, whereby we do not know if the current instruction has executed to completion. Doing the interrupt at the end of a writeback means:

  1. the PC we save (to return to later) is already the ‘next’ PC, be that prev_pc+2, or a branch target;
  2. memory reads have had time to complete successfully; and
  3. any registers have had time to see and act upon write enable signals to store data.

The items that are needed, therefore, are:

  • Internal registers for the stored PC (to return to after interrupt handler), the interrupt data field passed on the data in bus, and an interrupt enable bit
  • Various connections between the parts of the sub-modules for handling storing of the PC and interrupt data
  • Control unit additions for the interrupt handler step
  • New instructions for getting interrupt data and returning from an interrupt

Internal registers & Connections

I added a 16-bit register for the ‘next PC’ and also the ‘interrupt data’ to the ALU itself, rather than adding it to the register file. There are individual set/write control lines and also data lines for them into the ALU. It’s a bit messy and adds a lot of ports to the ALU and control unit, but it worked and I can change this later if I want to tidy things up. Having the registers part of the ALU makes the instructions that access them incredibly simple and self contained.

Control unit additions

The control unit now has an interrupt state, all of the control signals for setting the registers in the ALU and also the logic for managing the phases of calling into the interrupt handler. If interrupts are enabled, the interrupt input is active and it’s the end of the writeback phase, the following occurs:

  1. Interrupt_ack is activated
  2. A cycle of latency is provided
  3. The bits on the data in bus are sampled and the ALU instructed to store this value
  4. The current PC (which is, at this point, the next instruction to execute) is saved by the ALU
  5. The PC unit sets the current PC to the interrupt vector, currently fixed at 0x0008.
  6. The control unit resets it’s interrupt state, and proceeds to the fetch stage of the pipeline.

At the moment, interrupts are not disabled automatically when the handler is invoked, so the first instruction must be a disable interrupt instruction.

New Instructions

There are four new instructions used to manage and handle interrupts.

giefThe Get Interrupt Event Field transfers the value on the data bus at the time after an interrupt acknowledge into a register for further use. Using this value, we can work out what caused the interrupt and perform further actions from that point. An example of this is using it with a UART, the interrupt data field could contain the uart identifier in the high 8 bits, and the byte of data which was received in the lower 8 bits.

bbiBranch back from Interrupt is similar to the reti instruction in the Z80. It branches back to the PC value which was due to be fetched next before the interrupt handler was invoked.

eiThe enable and disable interrupt instructions are fairly obvious.

The interrupt vector

The interrupt vector is fixed at address 0x0008. The shape of the interrupt handler should be something like the following:

  1. disable interrupts
  2. Save all registers
  3. get the interrupt event data field
  4. Perform action according to interrupt event field, or add the field data to a queue for later processing.
  5. restore all registers
  6. enable interrupts
  7. Branch back to ‘normal’ code.

Saving the registers can be done by saving to the current stack and then restoring before returning from the handler. I’ve been using r7 as a ‘standard’ stack pointer in our very ad-hoc ABI spec, so this can be done. This does use user stack, though, so it needs taken into account if stack space is a particular concern.

There are a few issues that could occur, mainly in timing between disabling and enabling the interrupts. There could be a new interrupt to be handled when the enable interrupts instruction is processed, and this interrupt will then be accepted before the bbi instruction to branch back. This will destroy the original PC value when the original interrupt was raised, so I will probably change things around. There are a few solutions to this, one being that interrupts are by definition disabled when the branch to the interrupt vector occurs, and then a bbi instruction implicitly turns interrupts on again. I’ll need to have a think about the best course of action for this.

The makeup of the test interrupt routines I’ve had are like the following (snipped for clarity)

entry:
  load.h  r7, 0x08
  subi    r7, r7, 4
  bi      $start
  dw      0x0000
intvec:   #interrupt vector 0x8
  di
  # save the registers
  gief    r0
  #    inspect r0 for interrupt type
  #    branch to some other work
  # restore the registers
  ei
  bbi
start:
  load.l  r0, 0
  ...

The interrupt handler, whilst a bit messy in it’s implementation, works well in simulation. I’ve yet to use it when TPU is running on the FPGA with an external source, but I do not foresee many issues other than the one stated above.

A Look in the simulator

interrupts_waveform_numberedThe above waveform is showing an interrupt being flagged on a UART receive event, the event field containing the UART ID (1) and the byte value received (0x4f). Walking through the waveform, we get the following:

  1. The UART has received a byte and signaled this.
  2. An interrupt is immediately raised.
  3. Several cycles later the ACK is signaled by the cpu
  4. The interrupt event field(IEF) data is placed on the data in bus after a cycle of delay
  5. The ACKis de-signaled, and the IEF is removed from data in bus and saved internally (to later be used via the gief instruction)
  6. The CPU branches to the interrupt vector 0x0008, requesting the instruction from memory

The internal RAM

I mentioned previously that the design resources had shot up, and it turns out this is due mainly to the internal ram not being synthesized as a block ram. I was getting an internal compiler error in the Xilinx toolchain when building the existing ram with a larger capacity (I think it was 512bytes at this point) and to counter this I re-implemented the ram in another way. The way I did it, though, added an asynchronous element which in turn forced the toolchain to implement the RAM via look up tables, instead of utilizing the block ram. This is why there was a jump in resource requirements when using the Spartan6.

Block Rams

I could not get around the internal compiler error without an async element, so off to the documentation for the spartan6 I went. Turns out there is a document specifically on the block rams available on the device I have.

The block rams are used by initializing a generic object in VHDL to various constants, and then interfacing with the ports that object exposes. There are two kinds of block rams available, but I decided to use the 18 kilobit, dual-port one: RAMB16BWER. It is made up of 16Kb for data and 2Kb for parity. ISE has a nice template library for instantiation of primitives, and the block ram I use is included. It can be found within Edit->Language Templates, and then within the VHDL->Device Primitives->Spartan6->RAM/ROM.

lang_templatesThis brings up a window with initialization code to copy and paste into your own design. I took it, and edited the relevant areas to configure it for a 16-bit addressed memory.

Despite having the existing integrated ram address bytes explicitly, I decided against that with the block ram and instead addressed 16-bit values. To the TPU programmer, it still addresses bytes, but internally, it’s really stored at 16-bit, 2 byte blocks. The main reason for this was latency and complexity. By addressing 16-bit values internally in the block ram, I can implement both 16-byte reads/writes and also 8-bit reads and writes using a single port. The RAMB16BWER has a byte-wise write enable, so I can write either the high or low 8bits of a memory location internal to the block ram, leaving the other half untouched. There is one issue that arises from this method – an unaligned 16-bit read/write (i.e, the address being odd) will result in incorrect behavior. At the moment nothing happens if you try this, but I intend to add a trap/exception. I could maybe invoke the interrupt handler with a known interrupt event field value to specify an unaligned memory operation.

Issues

There were several gotchas I encountered whilst trying the block ram with a testbench. The addressing scheme, first of all, was confusing. As the generic component was initialized with relevant 16-bit addressing (18bit when you include parity), I assumed it would transform the address itself into the correct form. This did not seem to be the case after running the test bench. the documentation has a table of mappings and also a formula, but in the end it only took a few minutes of inspection in the simulator to work out what was happening.

blockramaddressThe next issue was a rather silly affair! The initialization attributes for the block ram are from most-significant to least-significant order. Due to this, 16-bit instructions need byte-flipped when read in the code, and also, they go from right to left along the initialization attribute.

-- BEGIN TASM RAMB16BWER INIT OUTPUT                                         
INIT_00 => X"06831180E27F00300000004F4C4C454801E102E100EF03E100000CC1E91E088E",

Maps to the instruction forms (only first 3 instructions shown):

X"8E", X"08", -- 0000: load.h  r7 0x08
X"1E", X"E9", -- 0002: subi    r7 r7 4
X"C1", X"0C", -- 0004: bi      0x0018
...snip...

I will not admit the amount of time spent trying to figure out the issue of byte flipping in the initialization attribute 😉

The least significant digit of the address, specifying the high/low byte of the 16-bit memory location, is managed in the VHDL process. Ive put that process (and other relevant signal operations) below for clarity. It’s a large block of text even without some of the less important generic attributes/initializations, which I have omitted.

RAMB16BWER_inst : RAMB16BWER
 generic map (
    -- DATA_WIDTH_A/DATA_WIDTH_B: 0, 1, 2, 4, 9, 18, or 36
    DATA_WIDTH_A => 18,
    DATA_WIDTH_B => 18,
    ...snip...
    -- SIM_COLLISION_CHECK: Collision check enable "ALL", "WARNING_ONLY", "GENERATE_X_ONLY" or "NONE" 
    SIM_COLLISION_CHECK => "ALL",
    -- SIM_DEVICE: Must be set to "SPARTAN6" for proper simulation behavior
    SIM_DEVICE => "SPARTAN6",
    ...snip...
 )
 port map (
    -- Port A Data: 32-bit (each) output: Port A data
    DOA => DOA,       -- 32-bit output: A port data output
    DOPA => DOPA,     -- 4-bit output: A port parity output
    -- Port B Data: 32-bit (each) output: Port B data
    DOB => DOB,       -- 32-bit output: B port data output
    DOPB => DOPB,     -- 4-bit output: B port parity output
    -- Port A Address/Control Signals: 14-bit (each) input: Port A address and control signals
    ADDRA => ADDRA,   -- 14-bit input: A port address input
    CLKA => CLKA,     -- 1-bit input: A port clock input
    ENA => ENA,       -- 1-bit input: A port enable input
    REGCEA => REGCEA, -- 1-bit input: A port register clock enable input
    RSTA => RSTA,     -- 1-bit input: A port register set/reset input
    WEA => WEA,       -- 4-bit input: Port A byte-wide write enable input
    -- Port A Data: 32-bit (each) input: Port A data
    DIA => DIA,       -- 32-bit input: A port data input
    DIPA => DIPA,     -- 4-bit input: A port parity input
    -- Port B Address/Control Signals: 14-bit (each) input: Port B address and control signals
    ADDRB => ADDRB,   -- 14-bit input: B port address input
    CLKB => CLKB,     -- 1-bit input: B port clock input
    ENB => ENB,       -- 1-bit input: B port enable input
    REGCEB => REGCEB, -- 1-bit input: B port register clock enable input
    RSTB => RSTB,     -- 1-bit input: B port register set/reset input
    WEB => WEB,       -- 4-bit input: Port B byte-wide write enable input
    -- Port B Data: 32-bit (each) input: Port B data
    DIB => DIB,       -- 32-bit input: B port data input
    DIPB => DIPB      -- 4-bit input: B port parity input
 );

 -- End of RAMB16BWER_inst instantiation


--
--todo: assertion on non-aligned 16b read?
--

CLKA <= I_clk;
CLKB <= I_clk;

ENA <= I_cs;
ENB <= '0';--port B unused

ADDRA <= I_addr(10 downto 1) & "0000";

process (I_clk, I_cs)
begin
  if rising_edge(I_clk) and I_cs = '1' then
    if (I_we = '1') then
      if I_size = '1' then
        -- 1 byte
        if I_addr(0) = '1' then
          WEA <= "0010";
          DIA <= X"0000" & I_data(7 downto 0) & X"00";
        else
          WEA <= "0001";
          DIA <= X"000000" & I_data(7 downto 0);
        end if;
      else
        WEA <= "0011";
        DIA <= X"0000" & I_data(7 downto 0)& I_data(15 downto 8);
      end if;
    else
      WEA <= "0000";
      WEB <= "0000";
      if I_size = '1' then
        if I_addr(0) = '0' then
          data(15 downto 8) <= X"00";
          data(7 downto 0)  <= DOA(7 downto 0);
        else
          data(15 downto 8) <= X"00";
          data(7 downto 0)  <= DOA(15 downto 8);
        end if;
      else
        data(15 downto 8) <= DOA(7 downto 0);
        data(7 downto 0) <= DOA(15 downto 8);
      end if;
    end if;
  end if;
end process;

O_data <= data when I_cs = '1' else "ZZZZZZZZZZZZZZZZ";

Assembler Output

The last thing to do was to add another output file generator to TASM, my c# TPU assembler. This simply outputs the whole 2KB initialization table for the input assembly. It’s then just copy/pasted into the VHDL in the appropriate attribute location.

Wrapping up

That’s it for this part. I really hope to have the next part with TPU talking to a peripheral device (and some changes to the ISA) in the next week or two. Fingers crossed!

Thanks for reading, comments as always to @domipheus.