This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.
It’s been a significant amount of time between this post and my last TPU article. A variety of things caused this – mainly working on a few other projects – but also due to an issue I had with TPU itself.
I had been hoping to interface TPU with an ESP8266 Wifi module, using the UART. For those not aware, the ESP8266 is a nifty little device comprising of a chipset containing the wifi radio but also a microcontroller handling the full TCP/IP stack. They are very simple to interface over a UART, working with simple AT commands. You can connect to a wifi network and spawn a server listening for requests in a few string operations – very powerful.
I started writing in TPU assembly the various code areas I’d need to interface with ESP8266. They were mainly string operations over the UART – so, things like SendString, RecieveString, and ExpectString, where the function waits until it matches a string sequence received from the UART. The code and data strings needed for a very simple test is well over 1KB which was a lot of code for this. I created various test benches and the code eventually passed those tests in the simulator.
At this point, I thought things were working well. However, on flashing the miniSpartan6+ board with my new programming file, nothing worked. I could tell the CPU was running something, but it was not the behavior I expected and integrated into the embedded block ram for execution.
When this happens, I usually expect that I’ve done something stupid in the VHDL and so I create post-translate simulation models of my design. This basically spits out further VHDL source which represents TPUs design but using general FPGA constructs, after compilation steps. In the software sense, imagine compiling C++ to C source – you’d see the internals of how virtual function calls work, and how the code increases in size with various operations. You can then simulate these post-translate models, which (in theory) give more accurate results.
The simulation completed (and took a lot longer than normal simulation), and the odd behavior persisted. So standard behavioral simulation worked, and post-translate model simulation failed – just like when on device. This is good, we can reproduce in the simulator.
Looking at the waveforms in the simulator, I could see what was going on: my code in the block ram was being corrupted somehow. When simulating my normal code, the waveform was as follows:
The important part of the waveform is the block of 8012 on mem_i_data, bottom right of the image. That is the value of location 0x001c, as set in the block ram. However, when running my post-translate model, the following result occurred:
The 0x8012 data is now 0x0012. The high byte has been reset/written/removed. The code itself was setting the high-order memory mapped UART address, so that failing explains why the TPU never worked with the ESP8266 chip.
... write.w r0, r1, 5 ##Send command to uart 0x12 load.h r0, 0x12 ...
The code itself above performs a write, before loading 0x12 into the high byte of r0. you can see from the simulation waveform that the write enable goes high – this is from the write.w instruction. The instructions following a write were having any memory address destination location overwritten.
It took far too long to realise what was causing it. Looking at the simulation waveforms for both behavioral and post-translate the actions leading up to the corruption seemed identical. The memory bus signals were all changing at the same cycle and being sampled at the same time, but, as the issue always happened with a write operation, attention was drawn to the write enable line. I tried various things at this point, spending an embarrassing amount of time on it.
The issue was staring me in the face, and in both waveforms – albeit not completely surfacing. When I had my previous embedded RAM implementation (before I moved to RAMB16BWER block ram) I sampled all inputs differently, only on the CMD line going active. The block rams were connected directly. The chip select signals are generated from the address which is active longer than needs be, and also, more importantly, the write enable remains active for a whole memory ‘cycle’. The remedy was to feed the block ram write enable with (WE or CMD) from the TPU output. This means WE only goes active for a single cycle. as with CMD.
Don’t keep your write enables active longer than needed, folks!
TPU is continuing very slowly, but progress is being made. Whilst trying to fix this issue I also integrated the UARTs that Xilinx provide for use. They have 16 entry FIFO buffers on both receive and transmit. This may help when interfacing with various other items, like the ESP8266. I’m still interested as to why the simulator didn’t show any differences in signal timing in the post-translate mode, despite being able to reproduce the issue. If anyone knows hints and tips for figuring out issues in this area, please let me know! This issue should really have been noticed by me sooner, though.
Thanks for reading, as always.