Highly optimized ILI9341 (320x240 TFT color display) library

Have been trying to use ILI9341_t3 library along with doing some other SPI stuff but don't get expected results, received bytes are delayed by one byte.
Here is code to reproduce problem. No hardware is connected except for a jumper from pin0 to pin12 to get some values into the SPI register.
I have taken a quick look. The issue is that the ILI9341_t3 library is using the FIFO on the SPI and the returned values from the drawPixel have not been poped off of the input fifo.

In the ILI9341_t3 library it needs a logical call to: waitFifoEmpty, which in the library header file is:
Code:
void waitFifoEmpty(void) {
		uint32_t sr;
		uint32_t tmp __attribute__((unused));
		do {
			sr = KINETISK_SPI0.SR;
			if (sr & 0xF0) tmp = KINETISK_SPI0.POPR;  // drain RX FIFO
		} while ((sr & 0xF0F0) > 0);             // wait both RX & TX empty
	}

I did a copy of that code into your test program (after the call to drawPixel) and I think it is getting better results.

The real question is when and where to do something like in the main code base. This happens because SPI directly uses the registers. In particular:
Code:
inline static uint8_t transfer(uint8_t data) {
		SPI0_SR = SPI_SR_TCF;
		SPI0_PUSHR = data;
		while (!(SPI0_SR & SPI_SR_TCF)) ; // wait
		return SPI0_POPR;
	}
It Pushes the command to write out and waits for it to complete and pops and returns the first item in the FIFO return queue. This does not properly handle the case that the FIFO may have something already in it...

I could hack something in, but Paul may want to think it through. Things like maybe BeginTransaction (if the settings don't match, clear the queues...)
 
I have taken a quick look. The issue is that the ILI9341_t3 library is using the FIFO on the SPI and the returned values from the drawPixel have not been poped off of the input fifo.

In the ILI9341_t3 library it needs a logical call to: waitFifoEmpty, which in the library header file is:
Code:
void waitFifoEmpty(void) {
		uint32_t sr;
		uint32_t tmp __attribute__((unused));
		do {
			sr = KINETISK_SPI0.SR;
			if (sr & 0xF0) tmp = KINETISK_SPI0.POPR;  // drain RX FIFO
		} while ((sr & 0xF0F0) > 0);             // wait both RX & TX empty
	}

I did a copy of that code into your test program (after the call to drawPixel) and I think it is getting better results.

The real question is when and where to do something like in the main code base. This happens because SPI directly uses the registers. In particular:
Code:
inline static uint8_t transfer(uint8_t data) {
		SPI0_SR = SPI_SR_TCF;
		SPI0_PUSHR = data;
		while (!(SPI0_SR & SPI_SR_TCF)) ; // wait
		return SPI0_POPR;
	}
It Pushes the command to write out and waits for it to complete and pops and returns the first item in the FIFO return queue. This does not properly handle the case that the FIFO may have something already in it...

I could hack something in, but Paul may want to think it through. Things like maybe BeginTransaction (if the settings don't match, clear the queues...)

I think that's the exact problem for my issue described in this thread some time ago: https://forum.pjrc.com/threads/2761...ptimized-ILI9341-library-and-other-SPI-device

Maybe Paul can have a look at this and fix the library, that would be really cool :)
 
I added a flush() method to SPI.h like so
Code:
	inline static void flush() {
		SPI0_MCR |= SPI_MCR_CLR_TXF | SPI_MCR_CLR_RXF;
	}

and calling in SPI.beginTransaction() by default. At least first tests showed no errors, it seems that this is working as expected.
 
I thought about doing something similar. Was not sure if I was going to do it on all calls to SPI.beginTransaction or only when something changes.

Also was curious about if it changed the performance. For example did you run the graphictest program before and after and did the numbers change?

the drawPixel call internally does a begin/end transaction and this function is internally called a lot to do things like draw lines, so I am leaning toward trying to minimize impact. In particular in the code block:
Code:
		if (SPI0_CTAR0 != settings.ctar) {
			SPI0_MCR = SPI_MCR_MDIS | SPI_MCR_HALT | SPI_MCR_PCSIS(0x1F);
			SPI0_CTAR0 = settings.ctar;
			SPI0_CTAR1 = settings.ctar| SPI_CTAR_FMSZ(8);
			SPI0_MCR = SPI_MCR_MSTR | SPI_MCR_PCSIS(0x1F);
		}
Probably part of the 2nd setting of SPIO_MCR. I know I have done it before like you mentioned, but was always wondering if it was correct as the PDF stated:
Contains bits to configure various attributes associated with the module operations. The
HALT and MDIS bits can be changed at any time, but the effect takes place only on the
next frame boundary. Only the HALT and MDIS bits in the MCR can be changed, while
the module is in the Running state.
 
I tried both ways, only doing when something changes or do every time. Timing did not appear to change much either way. So left always doing, and created a pull request to the SPI project

Sample runs, before change:
Code:
Benchmark                Time (microseconds)
Screen fill              224084
Text                     13680
Lines                    58589
Horiz/Vert Lines         18374
Rectangles (outline)     11686
Rectangles (filled)      465324
Circles (filled)         69758
Circles (outline)        57633
Triangles (outline)      14183
Triangles (filled)       153957
Rounded rects (outline)  26517
Rounded rects (filled)   507952
Done!

After Change:
Code:
Benchmark                Time (microseconds)
Screen fill              224086
Text                     13431
Lines                    58592
Horiz/Vert Lines         18376
Rectangles (outline)     11694
Rectangles (filled)      465345
Circles (filled)         70068
Circles (outline)        54920
Triangles (outline)      14190
Triangles (filled)       153936
Rounded rects (outline)  25826
Rounded rects (filled)   508085
Done!
 
Cool - Maybe Paul can try that in his new SD code (if needed) [or the old SD?] and comment. Could that abort any open pending transactions dumping the queue?

Glancing at that I saw +/- : doing a sum (1,618,549) the AFTER is actually -3188, doing & displaying that sum in the benchmark code would be a minor edit. A sum would be adding 'different sized apples' - but looking one by one at microseconds - seeing the net cumulative diff isn't easy at a glance.

Circles (outline) are -2713 [faster] and Rounded rects (outline) are -691 [faster] - so other slowdowns net only 216 mixed in the other 10 items.
 
I think it could make trouble in case of SPI DMA if the buffers are cleared in "beginTransaction". However I think we should find out the reason, since the RX FIFO buffer should always be empty after calling "transfer" method. If this is not so this is not the desired behaviour. Since per definition sending SPI 8 bytes also receives 8 bytes, so after sending we need always to get that bytes out of the RX FIFO regardless if it is used or not.

Apropos, "SPI0_CTAR1 = settings.ctar| SPI_CTAR_FMSZ(8);", why is the second SPI configuration register initialized with 9 bytes transfer size and not 16 by default?
 
Last edited:
After playing around some more I found I can use the ILI9341_t3 library along with other SPI stuff if run the teensy 3.1 at 48MHz (non optimized). Running at 48MHz optimized or faster causes problems.
 
@Paul, just commenting on this post about font compression. Sorry I was too lazy to read through all other posts.
A technique we used in the past is first take the EXOR between each pair of scanlines, since most font characters consist of identical rows. Then it perfoms run-length encoding. The following shows the progress. It is not particulary efficient on small fonts, since the redundancy is small. On big fonts it worked very good.
Code:
.XXXXX.   .XXXXX.
.X.....   ..XXXX.
.X.....   .......
.X.....   .......
.XXXXX.   ..XXXX.
.X.....   ..XXXX.
.X.....   .......
.X.....   .......
.XXXXX.   ..XXXX.

.XXXXX...XXXX.................XXXX...XXXX.................XXXX.
15    3  4   17               4   3  4   17               4   1
For big fonts, lots of EXORed lines contain mostly white with single black pixels. To allow effecient encoding, we used the two leading bits in each byte as an encoding and the other 6 as the length
Code:
eellllll
00: black
11: white
01: 1 black bit and lllll white bits
10: 1 white bit and lllll black bits.
Run lengths of > 64 bits are split into multiple 64 bit segments.
A big capital H can be compressed needing only a few bytes. The 7 example you gave is about the worst case, because of the diagonal line (it is also the worst case for your squares approach). With the run length encoding, you can compress it to 28 bytes, and still have a fast decompression algorithm.
 
Bugs ?

Code:
void ILI9341_t3::drawPixel(int16_t x, int16_t y, uint16_t color) {
...
[I][B]	writecommand_cont(ILI9341_RAMWR); [/B][/I]
	writedata16_last(color);
...

void ILI9341_t3::drawFastVLine(int16_t x, int16_t y, int16_t h, uint16_t color)
...
	[I][B]writecommand_cont(ILI9341_RAMWR);[/B][/I]
	while (h-- > 1) {
...
void ILI9341_t3::drawFastHLine(int16_t x, int16_t y, int16_t w, uint16_t color)
...
	[I][B]writecommand_cont(ILI9341_RAMWR);[/B][/I]
	while (w-- > 1) {
...
void ILI9341_t3::fillRect(int16_t x, int16_t y, int16_t w, int16_t h, uint16_t color)
...
	[I][B]writecommand_cont(ILI9341_RAMWR);[/B][/I]
	for(y=h; y>0; y--) {
...

Should'nt this be "writecommand_last(ILI9341_RAMWR);", or is this intended ?

Edit: There are several other places too. I tested this change, it works.
In addition, there is i.e. a line "writecommand_last(ILI9341_NOP);" in drawChar() which can be deleted. Several other "ILI9341_NOP" too.

Question: Could the original-version cause problems with other SPI-Devices ?
I was searching for a bug in my very complex project which has additional RAM, since i've done this change it works (?!)
But I need more testing, the bug occured only from time to time and after some minutes, so i'm not 100% sure.
 
Last edited:
Code:
Question: Could the original-version cause problems with other SPI-Devices ? 
I was searching for a bug in my very complex project which has additional RAM, since i've done this change it works (?!)
But I need more testing, the bug occured only from time to time and after some minutes, so i'm not 100% sure.[/QUOTE]

<strikethrough>no way.. my other bug is still existent.</strikethrough>

Edit:
hehe.. no, it's gone.. there where an additional problem..
 
Last edited:
I am pretty sure it is intended. You want to hold cs asserted until you output your last byte of the last pixel you are writing out...
 
I am pretty sure it is intended. You want to hold cs asserted until you output your last byte of the last pixel you are writing out...

No this is for commands only. They are sent before the pixels.
In the meantime i edited the whole lib(still works) and took a look at the datasheet. The datasheet says cs for cmd should be 1 for the first bit of pixeldata.
So it might was intended but not correct !
:)
Give it a try :)
Ill do a pullrequest later this day.
 
Last edited:
Datasheet (please noice the "1" for command-cs):
ILI9341_CSDC.png

preliminary updated lib (please test !) :
View attachment ILI9341_t3.zip

- it's a tiny little bit faster too, because of the deleted "NOP"s
Code:
I9341 Test!
Display Power Mode: 0xCE
MADCTL Mode: 0x24
Pixel Format: 0x2
Image Format: 0x0
Self Diagnostic: 0xE0
Benchmark                Time (microseconds)
Screen fill              224085
Text                     12547
Lines                    60461
Horiz/Vert Lines         18386
Rectangles (outline)     11638
Rectangles (filled)      465324
Circles (filled)         69718
Circles (outline)        53856
Triangles (outline)      14259
Triangles (filled)       154124
Rounded rects (outline)  24546
Rounded rects (filled)   507630
Done!
 
Last edited:
I have found that I can use the ILI9341_t3 library (at all speeds) along with other SPI stuff if I introduce a delay between popping KINETISK_SPI0.POPR and then reading KINETISK_SPI0.SR. I guess the RXCTR value takes a bit of time to actually decrement after doing a read of KINETISK_SPI0.POPR. The faster the speed the more time (nop's) needed to waste.

Code:
	//void waitFifoEmpty(void) __attribute__((always_inline)) {
	void waitFifoEmpty(void) {
		uint32_t sr;
		uint32_t tmp __attribute__((unused));
		do {
                        asm volatile("nop");//need to here 48MHz optimized
                        asm volatile("nop");
                        asm volatile("nop");
                        asm volatile("nop");

                        asm volatile("nop");
                        asm volatile("nop");//need to here 96 non opt
                        asm volatile("nop");//need to here 72 non opt
                        asm volatile("nop");

                        asm volatile("nop");//need to here 96 opt
                        asm volatile("nop");//need to here 72 opt,96 opt won't work if here(need 2 more nop)
                        asm volatile("nop");
                        asm volatile("nop");  
          
			sr = KINETISK_SPI0.SR;
			if (sr & 0xF0) tmp = KINETISK_SPI0.POPR;  // drain RX FIFO
		} while ((sr & 0xF0F0) > 0);             // wait both RX & TX empty
	}
 
Indeed waitTransmitComplete() was not working properly. It was returning too soon, leaving stuff in the FIFO, which interferes with other libraries or SPI usage.

Here's a (hopefully) fixed version.
 

Attachments

  • ILI9341_t3.h
    8.7 KB · Views: 296
drawBitmap ?

There's a function "void ILI9341_t3::drawBitmap(..."

It reads the bitmap from flash. Is there a converter-tool that converts a picture (bmp ?) ?

Currently i read from SD, wich is a bit slow, and has other disadvantages. I want to load the picture and do a kind of alpha-blending later - if i had the picture in the flash this would be great speedup.
 
This little T3.1 is unbelievable fast...
Alphablending 18-Bar 25Hz FFT256 Spectrum of software-decoded 128kBps WLAN webradio-MP3 Stream (2MBaud UART / ESP8266), 128KB SPI-RAM@20Mhz(more needed for more reliability, I will attach more soon), ILI9341@30MHz, IR-Remote-Receiver (via Timercapture-Interrupt), 20% processorusage (note, this is not the "audioprocessorusage", it's the average overall usage measured over 300ms intervals )
20150509_000925.jpg

Watch Video

..but I'm running out of flashmemory now :-(
 
Last edited:
I have parts of the background-picture (only the background of the bars) in the program flash.
Even if it's just the background of the bars : It takes about 50KB (2 Bytes per pixel)
Additionally, there are some more areas, it's not visible in the photo because it is not finished yet. The flash memory is running out... Perhaps i could use SPI-Flash, but this would take additional time for the transfers.
 
I have previously used the ILI9341_t3 library successfully with the Teensy Audio Board, using the pinout and code instructions on the ILI9341 store page.
But after some testing today, I realize that I am using an updated version of the library that needs to have MOSI and SCLK specified in the contructor instead of changing the pins with the SPI object directly.
With the updated ILI9341_t3 library the example code for using it with the Teensy Audio Board on the ILI9341 store page could be like this:

Code:
// For using with the Teensy Audio board
#define TFT_DC 20
#define TFT_CS 21
#define TFT_RST 255 // Set to 255 when reset is not used
#define TFT_MOSI 7
#define TFT_SCLK 14
#define TFT_MISO 12

ILI9341_t3 tft = ILI9341_t3(TFT_CS, TFT_DC, TFT_RST, TFT_MOSI, TFT_SCLK, TFT_MISO);

void setup() {
  tft.begin();

This finally solved my issues with the Teensy Audio Board and the optimized ILI9341_t3, thanks a lot for the hint!


I know that you are a busy man, Paul. But it would be nice for the next person to have the code up to date on your ILI9341 store page.

I can also make a pull request to the graphicstest.ino example commented out lines with the expanded defines and contructor. I feel this is the main display test code.

...yes, it would be great if this fix could be added to the examples and to the example code on the product page...


Moving over to learn more about how to actually use the graphics library now :) thanks for the great work!
 
Last edited:
You are very welcome. Such great work has gone into optimizing this library.
It would be nice if the example code is updated. That would make it more straightforward to start using the library without having to sift through the source code to find the correct constructor.
Paul Stoffregen, if you see this. I posted an updated example code for using the ILI9341_t3 library with the Audio Board.
 
Back
Top