Highly optimized ILI9341 (320x240 TFT color display) library

I'm probably misunderstanding what you mean by "backbuffering"...

You seemed to understand, but just in case: I mean writing to bitmaps, writing those to the display. Critically, using DMA to do so, so you get the 20%-30% throughput increase as well, and combining that with a dirty rectangle algorithm so you're only writing bits of the display that have changed.

You put all that together and you've got a great recipe for high performance graphics.
 
Yeah I was doing some maths based on this idea. Except instead of buffering the whole screen I do small segments.

The issue with most of these libraries is that they draw immediately to the screen. This is inefficient for 3 reasons:

1) To write a single pixel at (x, y) you first need to transfer the column address range, then the page address range, then send 1 piece of color info. To set the column address you need to send 1 byte of the command ID, 2 bytes for the column start, and another 2 bytes for the column end, so 5 bytes total. Same for page address, then finally another 3 bytes for the color info (1 command byte + 2 color bytes). So that is 13 bytes per pixel.
Now imagine you draw a line that is 4 pixels long, that's 4*13 = 52 bytes to transfer. At worst the same line will take up a 4x4 box, which can be sent in 10 + 1 + 2 * 4 * 4 = 43 bytes if all the bytes are sent as one box.
2) If you buffer in blocks, lines draw next to each other get wrapped up into one transfer so you get them for free basically.
3) If you draw a line then immediately draw a box over that line, you can skip sending the color information for the line entirely.
 
That's essentially my point about the "data density" of transactions. Drawing bitmaps allows you to always fill your entire transaction window with data. You're dramatically cutting down on the total number of SPI transactions that you need, and that means less overhead, which means faster draws. Couple that with double buffering and DMA, which you can't do with direct writes and you've got another 20%-30% increase in performance. (raw frame rates)
 
Unfortunately I can't get the SPI thing working so I've had to go with the 8-bit parallel bus mode. So far I'm just bit-banging it by setting the pins manually. But I wonder if I could somehow leverage the DMA here....
 
OK, so for my corner-to-corner example, you'd wait for the dirty area to reach some critical size, then DMA it out while the next parts of the framebuffer is getting dirty. And so on. The Teensy is pretty darn fast, so dirtying the framebuffer is always going to go a lot quicker than streaming it out, but I can see from @Augs's logic it makes sense.
 
Unfortunately I can't get the SPI thing working so I've had to go with the 8-bit parallel bus mode. So far I'm just bit-banging it by setting the pins manually. But I wonder if I could somehow leverage the DMA here....
I'm not sure how many pins the Teensy can control with DMA but NXP chips are generally pretty capable, so my guess is that it is possible, but non-trivial. The ESP32s have support for DMA with i8080 parallel but i've not seen anything for the teensy. then again, i haven't looked that hard.
 
OK, so for my corner-to-corner example, you'd wait for the dirty area to reach some critical size, then DMA it out while the next parts of the framebuffer is getting dirty. And so on. The Teensy is pretty darn fast, so dirtying the framebuffer is always going to go a lot quicker than streaming it out, but I can see from @Augs's logic it makes sense.

The best thing as always is to profile. The one thing to remember when you work your figures though is (a) to use DMA you MUST build a bitmap, even it's all one color. (b) to use DMA *effectively* you really should keep two backbuffers such that you can draw to one *while* the other is being sent to the display. That will give you maximum draw throughput. In practice in a lilygo TTGO T-Display 1.1 which is ESP32 based with a 240x135 color LCD i get about 77FPS doing fullscreen draws using this method, under both Arduino and the ESP-IDF. By my back of the napkin estimation, that's roughly hitting the 40MHz transfer speed ceiling. I've achieved 50-60FPS on a 320x240 80MHz SPI display using similar. I don't have teensy metrics, but I'd expect that you could hit the transfer ceiling even while doing fullscreen draws like i do on the esp32. For the record, this is the draw htcw_uix+htcw_gfx fire demo
 
Is this the most up to date benchmark? I'm writing my own similar library and trying to see how it compares
Bumping this question since it seems to have been buried. Can someone with the latest SPI library run the benchmark for me on the 320x240 ILI9341 screen? Thx
 
That's essentially my point about the "data density" of transactions. Drawing bitmaps allows you to always fill your entire transaction window with data. You're dramatically cutting down on the total number of SPI transactions that you need, and that means less overhead, which means faster draws. Couple that with double buffering and DMA, which you can't do with direct writes and you've got another 20%-30% increase in performance. (raw frame rates)
All interesting stuff. However a lot of this may also depend on your usage cases. For example if you are always drawing full screen images, or
filling the screen with a solid color, then using DMA could be slower if you can not do anything else in between operations. That is it takes time to fill your buffer, either with full color or a piece of your image... And for example with something like writeRect:
Code:
void ILI9341_t3::writeRect(int16_t x, int16_t y, int16_t w, int16_t h, const uint16_t *pcolors)
{
       beginSPITransaction(_clock);
    setAddr(x, y, x+w-1, y+h-1);
    writecommand_cont(ILI9341_RAMWR);
    for(y=h; y>0; y--) {
        for(x=w; x>1; x--) {
            writedata16_cont(*pcolors++);
        }
        writedata16_last(*pcolors++);
    }
    endSPITransaction();
}
There is only one transaction, and in almost all cases, will keep the SPI Fifo queue from becoming empty, so SPI should output at full speed.

However with fillRect,
Code:
void ILI9341_t3::fillRect(int16_t x, int16_t y, int16_t w, int16_t h, uint16_t color)
{
    // rudimentary clipping (drawChar w/big text requires this)
    if((x >= _width) || (y >= _height)) return;
    if(x < 0) {    w += x; x = 0;     }
    if(y < 0) {    h += y; y = 0;     }
    if((x + w - 1) >= _width)  w = _width  - x;
    if((y + h - 1) >= _height) h = _height - y;

    // TODO: this can result in a very long transaction time
    // should break this into multiple transactions, even though
    // it'll cost more overhead, so we don't stall other SPI libs
    beginSPITransaction(_clock);
    setAddr(x, y, x+w-1, y+h-1);
    writecommand_cont(ILI9341_RAMWR);
    for(y=h; y>0; y--) {
        for(x=w; x>1; x--) {
            writedata16_cont(color);
        }
        writedata16_last(color);
        if (y > 1 && (y & 1)) {
            endSPITransaction();
            beginSPITransaction(_clock);
        }
    }
    endSPITransaction();
}
They choose to split up the SPI transaction after every other line. Why? Mainly because the SPI code could cause the Audio code to
sputter to get the next chunk of output, even if it coming from SDIO...

So if you do this with DMA and you don't break it up, then you can run into the same issue with Audio...

Always trade offs.
 
All interesting stuff. However a lot of this may also depend on your usage cases. For example if you are always drawing full screen images, or
filling the screen with a solid color, then using DMA could be slower if you can not do anything else in between operations. That is it takes time to fill your buffer, either with full color or a piece of your image... And for example with something like writeRect:
...
They choose to split up the SPI transaction after every other line. Why? Mainly because the SPI code could cause the Audio code to
sputter to get the next chunk of output, even if it coming from SDIO...

So if you do this with DMA and you don't break it up, then you can run into the same issue with Audio...

Always trade offs.
Sure. But usually you have to make that choice during the design cycle, by which i mean you typically can't just pick and choose. You're either backbuffering or not, so generally you don't have the luxury of deciding for each individual draw scenario.

In my "real world ish" tests - dogfooding version 1.x (direct writes) vs 2.x (backbuffering) of my own libs i virtually always get better raw frame rates for the latter, because you just aren't usually drawing simple rectangles or straight horiz or vert lines, which is the best case scenario for direct writes. In my 1.x lib I went to great lengths to batch operations into rectangular buffers so i could reduce the transaction overhead where possible. Even my best efforts, backbuffering gave me better frame rates pretty much across the board for apps I ported from 1.x to 2.x.

That's why i brought up real world scenarios initially - because it comes down to what you're going to end up doing. Do you plan on rendering text, for example? That will kill direct write perf.. And overall that's why LVGL is typically faster than say, TFT_eSPI doing the same sorts of "real world" applications.

Where direct writes really shine is where you don't have the memory for backbuffering. That's just not the case on the teensy.

Given that you generally will have to decide between back buffering and direct writes, the above matters a lot.


Also regarding audio. My UIX user interface lib uses dirty rectangles to update the display. it can do partial updates which is crucial here (I've noticed a lot of teensy libs only do fullscreen DMA refresh, which frankly, i don't find practical or performant for most scenarios). It basically uses coroutine so that it never blocks very long, and if you pass (false) to update it won't even up the entire dirties, just one. so you can interlace it with audio just fine, as I have and do, DMA or no. (also usually you have more than one DMA channel on an ESP32 or a Teensy 4 at least)
 
(also usually you have more than one DMA channel on an ESP32 or a Teensy 4 at least)
DMA channels do not run concurrently though, at least on the Teensy.
Fixing the incorrect I2S TX watermark levels used in the Audio library helps significantly, so that the hardware buffers more than one sample..
 
I've been doing some tests with my 8-bit interface version of this library(i.e. no SPI or DMA).
I use a backbuffer divided into sections of 16x16 pixels. So when you write to a pixel in the back buffer you dirty that section. Then you have to call another function to "commit" those dirty sections to the screen.

I've now got all the adafruit GFX functions added and I've run the same test suite that they have:

Times in milliseconds

Code:
Text.............. 8.944000
Lines............. 35.494999
FastLines......... 28.893999
Rects............. 23.980000
FilledRects....... 23.136000
FilledCircles..... 30.037001
Circles........... 28.430000
Triangles......... 11.554000
FilledTriangles... 17.839001
RoundRects........ 22.096001
FilledRoundRects.. 27.854000

In theory the 8-bit interface can go faster than SPI because the databus can send 8-bits at a time. Right now I'm just using digitalWriteFast to send the data, I wonder if someone who knows Teensy better could advise if there is a faster way to write to 8 GPIO pins at once. (I have seen the PORTD emulation but that just uses digitalWriteFast).

Code for anyone curious: https://github.com/AugsEU/ILI9341-Parallel-Augs/tree/master
 
Last edited:
Back
Top