ILI9341 with fullscreen DMA Buffer for Teensy 3.5 / Teensy 3.6 only

Not open for further replies.
Ok, was just making sure, still playing around with my CC Dummy Load project from time to time and started looking at code from various items I am going to use and noticed that the SPI was locked into the task. I will just have to route the Touch to another SPI. Luckily the T3.6 has a few spares.
Trying to figure out the best way ATM to put DMA for Teensy into my lib. The number of functions I need is quite small as IMO simpler the better.

The key to a lot of my video processing is the ability to send one scanline while processing the next.

The standard SPI library does not have the option to keep going, because of this most of the time is actually used up waiting for "slow" hardware.

I have a lot of ...

bool do_wait = false;

... loop

uint8_t val = calcnext(...);

if (do_wait)  // won't wait 1st iteration
do_wait = true; // will wait next time through loop

gDevice.send8(val, NO_WAIT);


... top of loop

The logic in the code above allows things like "free looping" and "free pixel conversions" as the time they take is inside the time it takes for the hardware to send the last byte. So basically it does the work it needs to and then just wait for the remainder of the byte to finish.

On the Due I realized a LOT of calculations could happen inside this "free" time frame.

ATM I'm going through the DMAChannel .h file and found a lot of functionality that looks similar to the DUE which is not surprising.

... I'll add some details a bit later... need to get a trailer load of rubble to the tip before it closes.
As I have mentioned in a couple of other threads, I have been experimenting with a version of the SPI library that added an async transfer capability. It is done by specifying an optional callback function. You can either use the callback or you can query to see if the transfer completed...

It is hard to know how much of it will be migrated into the official SPI, but if you are interested in taking a look, it is up at:

I have a version of the SSD1306 library that can use the new transfer functions including Async, up at:
Note: I may need to edit my version of SPI to define: #ifdef SPI_HAS_TRANSFER_ASYNC...

I also have a version of the ILI9341 library that I edited to only use these functions. I believe I posted a zip file with it, but if not could, if you are interested.
Finally got DMA working on Teensy 3.1 with SPI. Looks very promising and the code is actually quite simple. The async part is also working so I should be able to run the SSD1351 128*128 demo tonight with some luck.

At the moment the benchmarks seem to be approx 5-10% better than on the Due. This from what I can tell is because even though the Teensy is lower Mhz it can run the SPI at a higher rate.

On the Due it's only 16.8Mhz off memory (div=5) whereas the Teensy is 18Mhz (div=4) as the OLEDs are rated at 20Mhz. Since the SPI transfer is still the bottleneck the Teensy wins the SPI transfer race!
I'm trying to run the code from the example but get the same error everytime:
Desktop\ILI9341_t3DMA-master\ILI9341_T3_DMALIB_VIDEODEMO\ILI9341_T3_DMALIB_VIDEODEMO.ino:14:27: fatal error: SdFat.h: No such file or directory

The program recognizes the SD libary which can be found in the teensy libaries. Under utility SdFat.h can be found. I don't really now why the program doesn't see this file. Anyone a solution?
Frank B, is there an overhead to continuous DMA refresh to the display? In other words, is it better for me to manually call refreshOnce after I have made all my framebuffer changes, or is it fine to let DMA just go for it?
Maybe KurtE can add some informed feedback - FrankB has been offline for some weeks.

FrankB created/used the display DMA while throttling the OC'd T_3.6 to emulate the C64 - sound, video and processor instructions for real time gameplay of the native C64 code.

It will interact with RAM for the transfer - but generally be independent and low impact once set up for continuous DMA.
Thanks defragster. Any idea on this, KurtE?

I can see one possible problem with continuous DMA - if the framebuffer is modified partway through a DMA transfer, some tearing will occur and it could look like the display is flashing or glitching at times. Unfortunately none of the Teensy boards have enough RAM for double buffering. Would be nice to see 2MB or more of RAM on a future board!
IIRC even the potential T_4 candidate doesn't break 1 MB?

Indeed changing the RAM while it is getting pushed out could be ugly - there are ways to track that and has been done successfully. FrankB's emulation is bright and smooth and tear free. I've run that code but not looked under the covers. KurtE has worked with it AFAIK. If it is a problem - and you don't need realtime updates use the refreshOnce.
Note: I have not looked recently to see if Frank's code has been updated to handle refreshOnce to actually only update once... More in a minute.

Continuous versus one shot. As for overhead someone like Paul understand this a lot more than I do. There may be some delays in memory access due to multiple pieces of code trying to access RAM at the same time. I believe there are two ports? into memory that can access at the same time, I think that is why adding the DMAMEM key to memory may help at times to move it into higher memory and maybe more likely to use the other port which is less used... Probably did not explain very well.

For me: refresh continuous vs Single is more interesting in the usage cases. that is if your usages is like a film where you can also update your data from top down and time it with the refresh, than the continuous works great. If however you have a usage case, where maybe you wish to update a portion of the screen, and maybe you do it in multiple steps, like fill rect to background and draw new image, or maybe restore a portion of image and draw new stuff. Then for me a refresh once works better.
That is for example if your code does something logically like: fillScreen(black), Draw field 1, draw field 2...
Then if you have continuous updates turned on, you may very likely see partial updates and flashes on the screen where example you see part or all of the screen as a result of the fill screen... So in these cases (for the most part all of mine), I prefer to be in control of when the update happens and use the single update.

Now about this library RefreshOnce... At least the last time I looked, the RefreshOnce actually refreshed twice, the first time while still in the main refresh function and a second time using DMA, so it was actually slower and more overhead than just doing a logical drawRect...

The reason: is that the setup code, needs to logically set things up, where, it may output something like:
<Set horizontal limit>,X1, X2, <set Vertical limit>, Y1, Y2, <write mem>, [here is where the screen data starts to output DMA]

With the above the <> fields are output with the DC asserted and the others are with the DC not asserted. In order to change the DC you need the data output in the PUSHR register to output the full 32 bits of data including the CS status, and then after that we output 16 bit values to PUSHR for the rest of the data...

Problem was how to update the DC bits and properly have our data in sync- Frank ran into this and the simplest way was to output the first screen using normal non DMA PUSHRs and when that finished then enabled the DMA output, which the first data word in the first of the DMA chains starts at the first pixel...

In my version, I did not want to do that... So I had to hack up my DMAChain such that the first Item in the chain started at the second pixel and then I only had to manually do the PUSHR of the first pixel, before enabling DMA. But then I had to handle the continuous case, so I added another element at the end of the DMA chain with only one pixel in it (the first one)...

Hope that makes sense?

Also maybe soon we can enable this code as well on T3.5? If we really do have more memory... And/or someday I might update my version where we can have dma backing for partial screens, that I can move around...
Thanks KurtE - will digest this later today.

I did see your commented out code for clipping the DMA rectangle for partial screens :) A great idea though! Would love to see that implemented for partial refresh.

What I am currently doing is taking the display setup and transfer code (i.e. backbuffer/DMA code) from ILI9341_t3/t3DMA/t3n and separating it into it's own library Display.h that ONLY handles the hardware. It is very short.
All the graphic/drawing calls I am putting in a separate library Graphics.h that only operate on the backbuffer and has nothing to do with the display.

Unfortunately, because it is backbuffer based, it will only work on T_3.5 and T_3.6 which have enough memory to support that. But I am adding some really nice antialiased routines (like lines and curves) and fast alpha blending (using LUT) which I am lifting from some libraries I wrote for the GP32 open-source handheld more than a decade ago!
Hey KurtE - just an interesting idea - do you think it is possible to define a framebuffer at half the width and height (i.e. at 160x120) and then push each pixel and each line twice to upscale it 320 x 240 using DMA in an efficient way? That would result in a huge loss of resolution, but a smaller framebuffer at ~38k. Don't have any use case yet, but might be cool for some projects.
Hey KurtE - just an interesting idea - do you think it is possible to define a framebuffer at half the width and height (i.e. at 160x120) and then push each pixel and each line twice to upscale it 320 x 240 using DMA in an efficient way?

I saw this question go by before - not sure if it had resolution - perhaps you might find it with a lucky search engine?
Sorry I don't know the answer. There may be a couple things to look at to see how possible including

a) Can you configure the display to actually be the 160x120 and it takes care of blowing up each pixel...

b) DMA magic: I am out of my knowledge here on major and minor loops. Example can you somehow setup a minor loop of two which repeats the same item twice? for each pixel on a line? If so, you can maybe do that and then link another dma chain item to do the same thing on the next line, and then setup to increment to the next 160 words... Again not sure if you can setup a whole chain to do this, or if you would need to setup interrupts, that update the chain for each logical set of rows or ...

So the only advice I have is maybe start experimenting....
No worries. Nothing in the datasheet stands out about doing it in hardware. I'll have a play with DMA when my display(s) arrive :) I've also been reviewing your thread about reading data back from display, so I'm hopeful to also make a non-backbuffer version of my graphics library that supports alpha blending.
Here is my theory for upscaling a 1/4 size framebuffer using pure eDMA. Requires at least 2 DMA channels utilising the minor and major loops of each, but would likely work better with 4 channels as that would have no processor involvement.


Proposed method for upscaling a 1/4 size framebuffer using only DMA
  • DMA0 copies every line of the source to every second line and every second pixel of the display. The minor loop copies each pixel along the line. The trick is that the src is 16bit, but the dst is 32bit. The major loop covers the whole display, but only every second line.
  • DMA1 fills in the missing pixels in each line. The minor loop copies a 16bit pixel from src to the 'gap' in the dst left by the 32bit write, and then advances 2 pixels in the dst. The major loop covers only one line. DMA1 triggers once for every line of the display (every minor loop of DMA0).
  • This whole process then repeats for every other second line of the display, starting from the second line from the top, either using the same DMA channels again or by DMA0 triggering DMA2 once the major loop completes and using DMA2 and DMA3.
I downloaded the latest master from the github link (Link on #1 post).
In my project I would like to make all my changes to the framebuffer and then refresh the lcd once. I tried to use the refreshOnce function, but the program freezes as soon as the refreshOnce is called. By using continues refresh all is working but the output to the display is suffering from tearing and flickering. I looked at the code briefly and found that
is kept at 0 and therefore the
function is an endless loop when using refreshOnce. If I commend
it works but the gain in speed is zero compared to non DMA ☹.

  dmasettings[SCREEN_DMA_NUM_SETTINGS - 1].interruptAtCompletion();
  dmasettings[SCREEN_DMA_NUM_SETTINGS - 1].replaceSettingsOnCompletion(dmasettings[0]);
With ILI_DMASCATTER_GATHER the interrupt routine dmaInterrupt is not attached and dmaInterrupt function is the only place that sets rstop as far as I could find…

Am I using a broken version of the library?
Is there no gain in using refreshOnce?
Am I missing something or doing something wrong?
Is there a way of synchronising writes to the framebuffer so continues can be used?

Many thanks for your suggestions.
You might try my version of the library:

I have the ability to use frame buffer (so far T3.6... will try 3.5 when new beta comes out where 3.5 has more memory...)

Unless Frank updated his code, his Update once actually updates twice. Once without using DMA and once with... This was to get the display into a state where the DMA was happening with 16 bit outputs (needed to first output full PUSHR for 32 bits to get the other things in the right state, like which CS pins should be selected (in this case CS selected DC NOT).

In my version I worked around this, by outputting the first pixel using PUSHR and then start the DMA on the 2nd pixel... Needing to then have an extra item in chain which outputs the first pixel again at the end (for continuous output case)...
Hi KurtE
The only thing I found which bothers me is the time it needs to update the lcd.
I use the following test sketch
#include <ili9341_t3n_font_Arial.h>
#include <ili9341_t3n_font_ArialBold.h>
#include <ILI9341_t3n.h>

#include <SPIN.h>
#include "SPI.h"

#define	TFT_MOSI			11										// LCD Display SPI Data Out
#define	TFT_MISO			12										// LCD Display SPI Data In
#define	TFT_SCK				13										// LCD Display SPI Clock
#define	TFT_DC				9										// LCD Display Data / Command
#define	TFT_CS				10										// LCD Display SPI Select Device
#define	TFT_RST				33										// LCD Display Reset
#define	TFT_BL				34										// LCD Display Backlight


void setup() {
	Serial.println("ILI9341 Test!");

	pinMode(TFT_BL, OUTPUT);
	digitalWriteFast(TFT_BL, HIGH);


void loop(void) {

	uint32_t updateTime = micros();
	Serial.println(micros() - updateTime);


void testText() {
	uint32_t start = micros();
	tft.setCursor(0, 0);
	tft.println("Hello World!");
	tft.setCursor(0, 20);
	float_t t1 = ((float_t)micros() - (float_t)start) * 0.0000001f;
	tft.printf("%.7f\n", t1);
	tft.println(micros() - start);

T3.6 180MHz, SPI 30MHz

testText() : 1200us
updateScreen() : 44800us

I ran the exact same demo with Franks library in auto refresh and the whole lot takes 800us

Am I doing something wrong with your library?
My UpdateScreen will wait until the update completes... Not sure what Franks actually does...
If you wish for Async update... use UpdateScreenAsync.

The Update just does the equivalent of doing something like a fillScreen...
Got it, I cloned your library from github and found the branch Async-Support. By checking that one out UpdateScreenAsync is available and working a treat :)
Thank you very much for your work.
Not open for further replies.