ILI948x_t41_p - a parallel display driver for Teensy 4.1

Quick update: Now have version that builds for both Teensy 41 and Micromod with same library...
As shown on my clean desk.

Should get all of it checked in tomorrow... Today had no net until recent...
1718679069711.png
 
Another quick update:

@mjs513 and myself have been doing some testing with this and as the library now builds for both Micromod and Teensy 4.1
We (he) renamed the library up on github. The current Project and branch is up at:

We also merged our WIP branch of the other library into it's main branch:

With this: we updated the class name and renamed all of the source files, plus include files ...

In addition, all of the examples were updated to allow them to build on either MicroMod or Teesy 4.1. On the Micromod
we also changed the default IO pin usage, as to not use pins 11-13 to allow us to use the main SPI object.
Will also us to for example try hooking up the BuyDisplay ILI9488 display and also connect up some of the other IO pins
for Capacitive Touch to SPI.

Fixed a few other things today.

Now back to this or other diversions. :D
Nice, Now I don't feel so bad about my clean desks😀
You should see my desk when it is cluttered 🤣
 
I'm defiantly going to be trying this out once Devboard v5 arrives!

So what kind of delta do we see between SPI and Parallel? How much faster is it in reality based on testing and benchmarking?
 
Only timing tests were with the graphics test sketch:
Micromod SPIMM Parallel
Benchmark Time (microseconds)
Screen fill 615251
Text 10915
Lines 154798
Horiz/Vert Lines 51142
Rectangles (outline) 28386
Rectangles (filled) 1487087
Circles (filled) 177501
Circles (outline) 103593
Triangles (outline) 33562
Triangles (filled) 487144
Rounded rects (outline) 54398
Rounded rects (filled) 1626225
Benchmark Time (microseconds)
Screen fill 2592
Text 430
Lines 4847
Horiz/Vert Lines 1092
Rectangles (outline) 413
Rectangles (filled) 6436
Circles (filled) 2558
Circles (outline) 1827
Triangles (outline) 932
Triangles (filled) 2105
Rounded rects (outline) 838
Rounded rects (filled) 12433
 
Only timing tests were with the graphics test sketch:
Micromod SPIMM Parallel
Benchmark Time (microseconds)
Screen fill 615251
Text 10915
Lines 154798
Horiz/Vert Lines 51142
Rectangles (outline) 28386
Rectangles (filled) 1487087
Circles (filled) 177501
Circles (outline) 103593
Triangles (outline) 33562
Triangles (filled) 487144
Rounded rects (outline) 54398
Rounded rects (filled) 1626225
Benchmark Time (microseconds)
Screen fill 2592
Text 430
Lines 4847
Horiz/Vert Lines 1092
Rectangles (outline) 413
Rectangles (filled) 6436
Circles (filled) 2558
Circles (outline) 1827
Triangles (outline) 932
Triangles (filled) 2105
Rounded rects (outline) 838
Rounded rects (filled) 12433
615251/2592= 237.4 times faster

Which does not feel right... I might expect it to be maybe up to 8 times faster?

Code:
Note: mine appear to be a bit different:
Device Status: F4530400
    Order: BGR
    interface pixel format: 16 bit
Benchmark                Time (microseconds)
Screen fill              138255
Text                     7357
Lines                    213504
Horiz/Vert Lines         11567
Circles (filled)         76909
Circles (outline)        91726
Rectangles (outline)     6867
Rectangles (filled)      334102
Triangles (outline)      40666
Triangles (filled)       122419
Rounded rects (outline)  33574
Rounded rects (filled)   372867
Done!
ILI948x_t4x_p::MulBeatWR_nPrm_DMA(2c, 60001670, 153600
Wonder if you have Framebuffer turned on: for example the Screen fill test:
Code:
unsigned long testFillScreen() {
    unsigned long start = micros();
    lcd.fillScreen(ILI9488_BLACK);
    lcd.fillScreen(ILI9488_RED);
    lcd.fillScreen(ILI9488_GREEN);
    lcd.fillScreen(ILI9488_BLUE);
    lcd.fillScreen(ILI9488_BLACK);
    return micros() - start;
}
We fill the screen 5 times... Now if you have use Frame buffer turned on, It will not update the screen at all if done outside of this function.
If you put in the UpdateScreen just before the return with timings, it will only update the screen once. The rest of the time it issimply how long it takes to write to memory
 
@KurtE.

Running on the Micromod without framebuffer
Rich (BB code):
Benchmark                Time (microseconds)
Screen fill              138254
Text                     7652
Lines                    217629
Horiz/Vert Lines         11572
Circles (filled)         77834
Circles (outline)        94868
Rectangles (outline)     6886
Rectangles (filled)      334115
Triangles (outline)      41759
Triangles (filled)       123022
Rounded rects (outline)  34875
Rounded rects (filled)   373094
Done!

So I am seeing the same as you. And yeah I goofed

We fill the screen 5 times... Now if you have use Frame buffer turned on, It will not update the screen at all if done outside of this function.
If you put in the UpdateScreen just before the return with timings, it will only update the screen once. The rest of the time it issimply how long it takes to write to memory

with the framebuffer graphics test example fixed:

Code:
Screen fill              169861
Text                     449
Lines                    5589
Horiz/Vert Lines         1587
Rectangles (outline)     409
Rectangles (filled)      5786
Circles (filled)         2761
Circles (outline)        2139
Triangles (outline)      1083
Triangles (filled)       2559
Rounded rects (outline)  931
Rounded rects (filled)   11922
Done!

now thats with doing this in testfillscreen function:
C++:
unsigned long testFillScreen() {
  unsigned long start = micros();
  lcd.fillScreen(ILI9488_BLACK);
  lcd.updateScreen();
  lcd.fillScreen(ILI9488_RED);
  lcd.updateScreen();
  lcd.updateScreen();
  lcd.fillScreen(ILI9488_GREEN);
  lcd.updateScreen();
  lcd.fillScreen(ILI9488_BLUE);
  lcd.updateScreen();
  lcd.fillScreen(ILI9488_BLACK);
  lcd.updateScreen();
  return (micros() - start);
}

Actually seeing an increase in fill screens
 
@mjs513 @KurtE have you tried hooking up two displays to one FlexIO port on different CS pins?
I would be VERY interested to see how that works and what performance can be achieved.
 
@KurtE has been working on async updates to screen buffering on the Teensy MM and T41. So far MM seems to be working. As a tested decided to try my Teensy_OpenGl lib examples (3d models) out to test screen updates using sync and async using both SPI and the Parallel Library. Running at 600mhz, with the LowRes Teapot example with gourand shading:

Parallel display:
Sync: ScreenUpdate(us): 111785 typ
Async: ScreenUpdate(us): 96300 typ

SPI Display:
Sync: ScreenUpdate(us): 215699 typ
Async: ScreenUpdate(us): 216973 typ

So this should give you an idea of the deltas.
 
@KurtE has been working on async updates to screen buffering on the Teensy MM and T41. So far MM seems to be working.
Thanks, we fixed a few issues yesterday with the MMOD, like needing to flush the cache before doing DMA...

As for T4.1, I have what I think are most of the pieces of doing Async using Interrupts integrated in, but it is not working yet. The T41 library works sort of, but found it breaks down if you try for example to change from using 8 shifters to 4. Out of Curiosity may try the 8 in our combined library and see if it helps, but thinking I might try some reworking on how to do the interrupts. In particular, the current code interrupts when Buffer 0 is empty, and then fills 4 (or 8) buffers with new data. I believe the issue is we are not feeding the data fast enough to shifter code...

So may see how hard it is to do like HardwareSerial with Watermark type stuff, like setup to interrupt when buffer 1 or buffer 2 is empty, and fill as many as our empty by the time the ISR is active... And see how doable that is and if it helps.

EDIT: Needless to say if you have a choice of using FlexIO 1 or 2 versus 3, go for 1 or 2!
 
@KurtE check out my T4.1 library (first post in this thread)
It should have interrupts implemented for the T4.1 if my memory serves right
 
@KurtE check out my T4.1 library (first post in this thread)
It should have interrupts implemented for the T4.1 if my memory serves right
Thanks, that is the library I mentioned in the previous post.
The T41 library works sort of, but found it breaks down if you try for example to change from using 8 shifters to 4.
With 8 shifters it is close, although it feels like the alighment is off.
1719101668383.png

If I shift to just 4 shifters, it is completely screwed up:
1719101728235.png


My port of it, it bails real quick and very little shows up on the screen. Still debugging, although did not get much done today
 
@KurtE - I ran into the same problem on the T4.1 with the RA8876. I think it was in the IRQ callback. It seems to me there was a missing piece of code. I have to go out for a while but when I get back I'll check on it. You might want to check the IRQ section of the T4.1 branch of my RA8876LiteTeensy repo on GitHub...

EDIT: Found it. Quicker than I thought. Here is the code:
Code:
FASTRUN void RA8876_t3::flexIRQ_Callback(){
  if (p->TIMSTAT & (1 << TIMER_IRQ)) { // interrupt from end of burst
        p->TIMSTAT = (1 << TIMER_IRQ); // clear timer interrupt signal
        bursts_to_complete--;
        if (bursts_to_complete == 0) {
            p->TIMIEN &= ~(1 << TIMER_IRQ); // disable timer interrupt
            asm("dsb");
            WR_IRQTransferDone = true;
            CSHigh();
            _onCompleteCB();
            return;
        }
  }
  if (p->SHIFTSTAT & (1 << SHIFTER_IRQ)) { // interrupt from empty shifter buffer
        // note, the interrupt signal is cleared automatically when writing data to the shifter buffers
        if (bytes_remaining == 0) { // just started final burst, no data to load
            p->SHIFTSIEN &= ~(1 << SHIFTER_IRQ); // disable shifter interrupt signal
        } else if (bytes_remaining < BYTES_PER_BURST) { // just started second-to-last burst, load data for final burst
            uint8_t beats = bytes_remaining / BYTES_PER_BEAT;
            p->TIMCMP[0] = ((beats * 2U - 1) << 8) | (_baud_div / 2U - 1); // takes effect on final burst
            readPtr = finalBurstBuffer;
            bytes_remaining = 0;
            for (int i = 0; i < SHIFTNUM; i++) {
                uint32_t data = *readPtr++;
                p->SHIFTBUFHWS[i] = ((data >> 16) & 0xFFFF) | ((data << 16) & 0xFFFF0000);
                while(0 == (p->SHIFTSTAT & (1U << SHIFTER_IRQ))) {}
            }
        } else {
            bytes_remaining -= BYTES_PER_BURST;
            for (int i = 0; i < SHIFTNUM; i++) {
                uint32_t data = *readPtr++;
                p->SHIFTBUFHWS[i] = ((data >> 16) & 0xFFFF) | ((data << 16) & 0xFFFF0000);
Add this line ------------>  while(0 == (p->SHIFTSTAT & (1U << SHIFTER_IRQ))) {} <--------- Is missing in orginal code
        }
    }
  }
    asm("dsb");
}
I think this was the problem. It was not waiting for completion :D
 
Last edited:
I now have it working :D

The issue is the order we are filling in the shift buffers. If you look at the DMA code, there is some interesting looking stuff:
Code:
        sourceAddress = (uint16_t *)value + minorLoopBytes / sizeof(uint16_t) - 1; // last 16bit address within current minor loop
        sourceAddressOffset = -sizeof(uint16_t);                                   // read values in reverse order
        minorLoopOffset = 2 * minorLoopBytes;                                      // source address offset at end of minor loop to advance to next minor loop
        sourceAddressLastOffset = minorLoopOffset - TotalSize;                     // source address offset at completion to reset to beginning
        destinationAddress = (uint32_t *)&p->SHIFTBUFBYS[SHIFTNUM - 1];            // last 32bit shifter address (with reverse byte order)
        destinationAddressOffset = -sizeof(uint32_t);                              // write words in reverse order
        destinationAddressLastOffset = 0;
In the Minor loop, the first item it fills is: SHIFTBUFBYS[SHIFTNUM - 1];

So I now have that portion of code in the interrupt case like:
Code:
    if (p->SHIFTSTAT & (1 << SHIFTER_IRQ)) { // interrupt from empty shifter buffer
        DBGWrite('S');
        // note, the interrupt signal is cleared automatically when writing data to the shifter buffers
        if (bytes_remaining == 0) { // just started final burst, no data to load
            p->SHIFTSIEN &= ~(1 << SHIFTER_IRQ); // disable shifter interrupt signal
        } else if (bytes_remaining < BYTES_PER_BURST) { // just started second-to-last burst, load data for final burst
            uint8_t beats = bytes_remaining / BYTES_PER_BEAT;
            p->TIMCMP[0] = ((beats * 2U - 1) << 8) | (_baud_div / 2U - 1); // takes effect on final burst
            readPtr = finalBurstBuffer;
            bytes_remaining = 0;
            for (int i = SHIFTNUM - 1; i >= 0; i--) {
                digitalToggleFast(3);
                uint32_t data = readPtr[i];
                p->SHIFTBUFBYS[i] = ((data >> 16) & 0xFFFF) | ((data << 16) & 0xFFFF0000);

            }
        } else {
            bytes_remaining -= BYTES_PER_BURST;
            // try filling in reverse order
            for (int i = SHIFTNUM - 1; i >= 0; i--) {
                digitalToggleFast(3);
                uint32_t data = readPtr[i];
                p->SHIFTBUFBYS[i] = ((data >> 16) & 0xFFFF) | ((data << 16) & 0xFFFF0000);

            }
            readPtr += SHIFTNUM;
        }
        if (bytes_remaining == 0) {
            DBGWrite('L');
            p->SHIFTSIEN &= ~(1 << SHIFTER_IRQ);
        }
    }
And everything started to work.

There were a few other things to fix, plus turn off debug code, but all of that is now checked into the t41_async branch...

I am thinking about PR it back into our master branch.

Also maybe at some point, my have to try out my new shiny unused:
1719161013930.png


Side note: This line through me for a loop:
Code:
p->SHIFTBUFHWS[i] = ((data >> 16) & 0xFFFF) | ((data << 16) & 0xFFFF0000);
If I am reading it correctly: you reverse the words that you pass in to the SHIFTBUFHWS which reverses the words?
If so you might try:
Code:
p->SHIFTBUF[i] = data;
 
I am thinking about PR it back into our master branch.

Also maybe at some point, my have to try out my new shiny unused:
Definitely time to merge.

Yeah thought about it but running out of jumpers :)

Did test with the openGL sketch and it works no issues with synch or asynch
 
I now have it working :D

The issue is the order we are filling in the shift buffers. If you look at the DMA code, there is some interesting looking stuff:
Code:
        sourceAddress = (uint16_t *)value + minorLoopBytes / sizeof(uint16_t) - 1; // last 16bit address within current minor loop
        sourceAddressOffset = -sizeof(uint16_t);                                   // read values in reverse order
        minorLoopOffset = 2 * minorLoopBytes;                                      // source address offset at end of minor loop to advance to next minor loop
        sourceAddressLastOffset = minorLoopOffset - TotalSize;                     // source address offset at completion to reset to beginning
        destinationAddress = (uint32_t *)&p->SHIFTBUFBYS[SHIFTNUM - 1];            // last 32bit shifter address (with reverse byte order)
        destinationAddressOffset = -sizeof(uint32_t);                              // write words in reverse order
        destinationAddressLastOffset = 0;
In the Minor loop, the first item it fills is: SHIFTBUFBYS[SHIFTNUM - 1];

So I now have that portion of code in the interrupt case like:
Code:
    if (p->SHIFTSTAT & (1 << SHIFTER_IRQ)) { // interrupt from empty shifter buffer
        DBGWrite('S');
        // note, the interrupt signal is cleared automatically when writing data to the shifter buffers
        if (bytes_remaining == 0) { // just started final burst, no data to load
            p->SHIFTSIEN &= ~(1 << SHIFTER_IRQ); // disable shifter interrupt signal
        } else if (bytes_remaining < BYTES_PER_BURST) { // just started second-to-last burst, load data for final burst
            uint8_t beats = bytes_remaining / BYTES_PER_BEAT;
            p->TIMCMP[0] = ((beats * 2U - 1) << 8) | (_baud_div / 2U - 1); // takes effect on final burst
            readPtr = finalBurstBuffer;
            bytes_remaining = 0;
            for (int i = SHIFTNUM - 1; i >= 0; i--) {
                digitalToggleFast(3);
                uint32_t data = readPtr[i];
                p->SHIFTBUFBYS[i] = ((data >> 16) & 0xFFFF) | ((data << 16) & 0xFFFF0000);

            }
        } else {
            bytes_remaining -= BYTES_PER_BURST;
            // try filling in reverse order
            for (int i = SHIFTNUM - 1; i >= 0; i--) {
                digitalToggleFast(3);
                uint32_t data = readPtr[i];
                p->SHIFTBUFBYS[i] = ((data >> 16) & 0xFFFF) | ((data << 16) & 0xFFFF0000);

            }
            readPtr += SHIFTNUM;
        }
        if (bytes_remaining == 0) {
            DBGWrite('L');
            p->SHIFTSIEN &= ~(1 << SHIFTER_IRQ);
        }
    }
And everything started to work.

There were a few other things to fix, plus turn off debug code, but all of that is now checked into the t41_async branch...

I am thinking about PR it back into our master branch.

Also maybe at some point, my have to try out my new shiny unused:
View attachment 34766

Side note: This line through me for a loop:
Code:
p->SHIFTBUFHWS[i] = ((data >> 16) & 0xFFFF) | ((data << 16) & 0xFFFF0000);
If I am reading it correctly: you reverse the words that you pass in to the SHIFTBUFHWS which reverses the words?
If so you might try:
Code:
p->SHIFTBUF[i] = data;
So that was with DMA and not Async? I was using pushPixels16BitAsync() from the ILI948x_t41_p library when I encountered the the problem.
This was in the 16Bit parallel mode on the RA8876 display. Glad it's working :D
 
@KurtE - This is what I was getting without the added line:
Aasync_Error.jpg

The only difference is that it is skewed on the left of the image. Interesting. Is your new display using the GT9271 CTS controller?
EDIT: Something else that is interesting is if I use:
Code:
p->SHIFTBUFBYS[i] = ((data >> 16) & 0xFFFF) | ((data << 16) & 0xFFFF0000);
instead of:
Code:
p->SHIFTBUFHWS[i] = ((data >> 16) & 0xFFFF) | ((data << 16) & 0xFFFF0000);
I get this:
Endianess.jpg


Must be a difference in endianess between the two types of display...
 
Last edited:
@Rezo
Decided to try your T4.0 library and doesn't seem to be working. PINS are per the readme:

Code:
Next, wire up your LCD - use Teensy pins:

pin 21 - WR
pin 20 - RD
pin 19 - D0
pin 18 - D1
pin 14 - D2
pin 15 - D3
pin 17 - D4
pin 16 - D5
pin 22 - D6
pin 23 - D7
Note doesnt work even if I hook up rd to 3.3ve

with
Code:
ILI948x_t40_p lcd = ILI948x_t40_p(10, 8, 9); //(dc, cs, rst)

This is in the monitor

Code:
ILI9488 Initialized
CMD: 0x4, SHIFT: 0x4
Dummy 0x0, data 0x0
Manufacturer ID: 0x00
CMD: 0xB, SHIFT: 0xB
Dummy 0x0, data 0x0
MADCTL Mode: 0x00
CMD: 0xC, SHIFT: 0xC
Dummy 0x0, data 0x0
Pixel Format: 0x00
CMD: 0xD, SHIFT: 0xD
Dummy 0x0, data 0x0
Image Format: 0x00
CMD: 0xF, SHIFT: 0xF
Dummy 0x0, data 0x0
Self Diagnostic: Failed (0x00)

And all I get is a white screen - rehooked it up a few times now.
 
Code:
p->SHIFTBUFBYS[i] = ((data >> 16) & 0xFFFF) | ((data << 16) & 0xFFFF0000);
With this one: if we say: that data is in byte order:0 1 2 3
The Shifts and ands and or: put them into the order: 2 3 0 1
and then the SHIFTBUFBYS: into the order 1 0 3 2
So the Words are still in the same order but the two bytes are swapped in each half word...

Code:
p->SHIFTBUFHWS[i] = ((data >> 16) & 0xFFFF) | ((data << 16) & 0xFFFF0000);
Where this one: you do the shift ands and ors and have: 2 3 0 1
And then SHIFTBUFHWS I believe swaps the two words again, so you are back to: 0 1 2 3

Which I believe you would get the same results as:
Code:
p->SHIFTBUF[i] = data;
But could easily be wrong. I believe it has happened before :unsure: 😆

You might try the way I did it by filling the buffers in reversed order in my case 3 2 1 0
I believe what happens is if you fill in 0 first, it might trigger the timer starting and then you might be in
a race with it shifting that 0th item into the output register and see it does not have 1 yet and funny things...

But again just a guess
 
@mjs513 I’m not sure I ever got the 4.0 version working, although if the 4.1 version works and supports an 8 bit bus, it should work on the 4.0 as well.
 
With this one: if we say: that data is in byte order:0 1 2 3
The Shifts and ands and or: put them into the order: 2 3 0 1
and then the SHIFTBUFBYS: into the order 1 0 3 2
So the Words are still in the same order but the two bytes are swapped in each half word...


Where this one: you do the shift ands and ors and have: 2 3 0 1
And then SHIFTBUFHWS I believe swaps the two words again, so you are back to: 0 1 2 3

Which I believe you would get the same results as:
Code:
p->SHIFTBUF[i] = data;
But could easily be wrong. I believe it has happened before :unsure: 😆

You might try the way I did it by filling the buffers in reversed order in my case 3 2 1 0
I believe what happens is if you fill in 0 first, it might trigger the timer starting and then you might be in
a race with it shifting that 0th item into the output register and see it does not have 1 yet and funny things...

But again just a guess
Thanks for the input :D That makes it a lot more clear. Will have experiment...
 
@mjs513 I’m not sure I ever got the 4.0 version working, although if the 4.1 version works and supports an 8 bit bus, it should work on the 4.0 as well.
Thanks. @KurtE mentioned that you need 8 consecutive Flexio pins. Looks at the T4 - not possible to get there from here. The T4 does not have 8 consecutive pins. The pins you have:

Code:
#define DISPLAY_D0 19   // FlexIO3: 0/ 19
#define DISPLAY_D1 18   // FlexIO3: 1/ 18
#define DISPLAY_D2 14   // FlexIO3: 2/ 14
#define DISPLAY_D3 15   // FlexIO3: 3/ 15
#define DISPLAY_D4 17   // FlexIO3: 6/ 17
#define DISPLAY_D5 16   // FlexIO3: 7/ 16
#define DISPLAY_D6 22   // FlexIO3: 8/ 22
#define DISPLAY_D7 23   // FlexIO3: 9/ 23

So going to ignore T4 support for now.
 
Back
Top