Yet another highly optimzed ILI9341 library: for T4/4.1 with diff updates and vsync !

vindar

Well-known member
Hello,

I have been happily using (and abusing) user-contributed libraries for a few years now so I think it is time I try to give something back... So here is my SPI driver for the ubiquitous ILI9341 screen:

[github: ILI9341_T4]

I know there are already several good libraries for driving this screen (such as KurtE's ILI9341_t3n lib) but I think my library adds a few neat tricks that makes it unique.

First, a warning: the library's sole purpose is to push memory frame buffers to the screen as fast and as efficiently as possible. The library does not contain any drawing primitive of any kind. These are to be handled by a dedicated canvas library drawing on the memory frame buffer. Also, this library only works with Teensy 4/4.1. Using multi-buffers requires lots of RAM so it would not really make sense to adapt it to less powerful MCU...

Here are the library's main features:

  • smart 'diff' redraw. The driver compares the frame buffer to be uploaded with the previous one (mirroring the current screen content) and uploads (almost) only the pixels that differ. It does so in a smart way to minimize SPI transaction / RAWRW commands. Uploading only part of the screen makes it possible to achieve extremely high frame rates when moderate changes occur between frames (hundreds of FPS for simple cases like UIs). Note here that diffs can be arbitrarily complex and are not restricted to rectangular regions...
  • async. updates via DMA. Upload can be performed directly or done asynchronously using DMA (even in the case of complicated diff updates) which means that the MCU is free to do other tasks (like generating the next frame) during updates. DMA SPI transfer can be clocked up to 50Mhz and can use the bus at its full capacity when needed. There is almost no delay toggling between command and data packets (thanks to KurtE's DMA code :). )
  • adjustable framerate. The screen refresh rate can be adjusted and a fixed frame rate can be set within the driver. Uploads are then timed to meet the requested frame rate. Everything happens using timers and DMA so there is no "busy wait".
  • vsync and screen tearing prevention. Now, this really is the best part :) The driver monitors the position of the current scan line being refreshed on the screen and orders the pixel updates so that they always trail behind this scan line. This makes it possible to completely suppress screen tearing provided the update can be done in less than two screen refresh periods ! In most cases, this makes it possible to reach a solid 50FPS without any screen tearing by setting the screen refresh rate to 100hz. And all this with modest SPI speeds...
  • Multiple buffering methods. Support direct upload, double buffering and triple buffering configurations.
  • driver for XPT2046 touchscreen. If present, the driver can manage the associated touchscreen on the same SPI bus. This simplifies the wiring since only 1 or 2 additional wires are needed in that case (for the touch cs pin and, possibly, the irq pin).

In practice, the library makes it simple to achieve a high frame rate combined with a high quality display. But of course, there is no such thing as a free launch and the price to pay is large RAM consumption: around 320Kb when using double buffering.

The library is still in beta and there are certainly many bugs to be found but it already does the job quite nicely. The screen tearing prevention is really fun to observe: if you blit 2 colors alternatively fast enough, they will simply simply melt into one instead of making an awful mess... There is a demo of this in the example folder of the library [and you can check the code: the violet color displayed on the screen is never uploaded to the screen ^^].

I did not write a separate documentation for the library but the code itself is pretty well documented. Info about public methods can be found in the header file in front of their declarations.

I hope someone find this library useful !
Best,
 
Terrific! Couple of thoughts/questions:

- Have you thought about mixing this with Adafruit_GFX hierarchy to provide compatibility with that API? Not sure where the actual primitives would come from, I assume that there is a RAM frame buffer based library out there that would work.

- Would this work with the T4.1 QSPI RAM add-on?
 
Adafruit_GFX already has frame buffer/canvas objects that work with all their primitives.
 
Terrific! Couple of thoughts/questions:

- Have you thought about mixing this with Adafruit_GFX hierarchy to provide compatibility with that API? Not sure where the actual primitives would come from, I assume that there is a RAM frame buffer based library out there that would work.

- Would this work with the T4.1 QSPI RAM add-on?

Hi blackketter,

As vjmuzik mentioned, there are already good libraries dedicated to drawing on a memory frame buffer. I think adding drawing primitives to this library would clutter it needlessly. It makes sense to do so when the library implements direct draw of primitives (as is the case of the ILI9341_t3(n) library) because it enables huge speedup compared to drawing pixels one after the other but this does not apply here.

Concerning QSPI RAM, that is a very good question... Well I just tried it and, surprisingly enough, it works ! Both the user and the library internal frame buffers can be put in EXTRAM. This means that SPI DMA can push pixels directly from the external ram to the screen which I thought would not work. However, as expected, access to the external ram is slow so it creates a significant slowdown. Here are the framerates obtained for the "99luftballons" example of the library when setting the vsync_spacing parameter to 0 to measure the maximum fps. I tested it for different memory locations by simply adding the DMAMEM/EXTMEM directive in front of the buffers.

Code:
Example: "99luftballons.ino"
SPI@30Mhz
vsync_spacing = 0
2x6K diff buffers in DTCM.

------------------------------------
|   User FB   | Internal FB |  FPS |
------------------------------------
|    DTCM     |    DTCM     |  70  |
|   DMAMEM    |   DMAMEM    |  66  |
|   EXTRAM    | DTCM/DMAMEM |  44  |
| DTCM/DMAMEM |   EXTRAM    |  41  |
|   EXTRAM    |   EXTRAM    |  21  |
-----------------------------------

To put these results in perspective, the theoretical max speed that can be achieved with SPI at 30Mhz when pushing full frames is 24fps so we still get a nice speed bump when using only one frame buffer in external memory. The best choice is probably to put the internal frame buffer in EXTMEM and keep the user frame buffer which is accessed frequently for drawing in DTCM... And since EXTMEM is huge, it doesn't really cost anything to do triple buffering with 2 internal buffers in EXTMEM. This may provide a few more fps (but probably not much).
 
I look forward to trying this library. I will have to do a bit of rewiring to use it. I am currently running ILI9341_t3N with pin 9 wired as the D/C control. Since pin 9 is not an SPI CS pin, I will have to move the D/C to pin 36 or 37. I am using the ILI9341 as a display with my OV7670 camera and the fast frame refresh capability should come in handy.

I have noticed a bit of streaking when I send camera buffers directly to the ILI9341 as both the camera and display are competing for DMAMEM. Do you think your anti-tearing code will have any effect on this?

The "refresh only what has changed" code seems interesting, but I'm not sure whether it will help much with camera images---where one or two bits of differences in pixels between successive frames are quite common.
 
Hi,

I look forward to trying this library. I will have to do a bit of rewiring to use it. I am currently running ILI9341_t3N with pin 9 wired as the D/C control. Since pin 9 is not an SPI CS pin, I will have to move the D/C to pin 36 or 37. I am using the ILI9341 as a display with my OV7670 camera and the fast frame refresh capability should come in handy.

I have noticed a bit of streaking when I send camera buffers directly to the ILI9341 as both the camera and display are competing for DMAMEM. Do you think your anti-tearing code will have any effect on this?

I have no idea ! Please try it and let me know :)
But in fact, I suspect my library may make things even worse in this case since it uses more interrupts than the ILI93141_t3n and thus makes memory access less predictable which is not a good thing with cache...


The "refresh only what has changed" code seems interesting, but I'm not sure whether it will help much with camera images---where one or two bits of differences in pixels between successive frames are quite common.

Yes, the typical usage is for UI and other CPU generated frames with solid colors that do not randomly fluctuate over time. I am considering adding a option to only update pixels whose value change more than a given threshold. This might be interesting in case usage like yours and may even cancel some noise but I am still thinking about the best way to implement it.

I guess you want 30FPS which is the framerate of the camera right ? If you are doing mostly full redraws because the camera image fluctuates continuously, then you will need a 50Mhz SPI. I noticed that using SPI speeds faster than 55Mhz seems to cause problem trouble but unfortunately, I cannot really debug it because my logic analyzer gives up at 40Mhz...

Anyway, you can try getting stable 30hz framerate without screen tearing by setting:
Code:
tft.setDiffSplit(8);
tft.setRefreshRate(60);
tft.setVsyncSpacing(2);

Moreover, if you know that the "diffs" produced are mostly trivial because too many pixels change each time, then you may want to put
Code:
tft.forceFullRedraw();
just before the tft.update() method. This will prevent from computing the diff so it will save about 1ms CPU every frame (but we must still initially load a diff buffer into the driver to enable double buffering).

Most importantly: if you can afford it, use screen orientation 0 because this is the only orientation where the frame buffer pixel layout matches the screen refresh order. In this case, the number of SPI transactions is much lower for trivial diffs because they do not need to be broken at each new line. In orientation mode 0. the library should theoretically perform at least as well as ILI93141_t3n... but that is only theoretically of course :)

One last thing: the printStats() method (implemented for both driver and diffs objects) is really handy for debugging and optimizing the settings....
 
Wow, really glad to see you used the vsync info! I had long dreamed of doing that with ILI9341_t3, and months ago I even added the functions to read the scan line... but never had the time to actually put it to good use.
 
I wasn't aware of that Paul, otherwise I had contributed my code for that, which I wrote last year :)
 
Last edited:
@vindar and @Frank B and all - I would always be glad to merge it into ili9341_t3n...

Sounds like a great thing to have!
 
Most importantly: if you can afford it, use screen orientation 0 because this is the only orientation where the frame buffer pixel layout matches the screen refresh order. In this case, the number of SPI transactions is much lower for trivial diffs because they do not need to be broken at each new line. In orientation mode 0. the library should theoretically perform at least as well as ILI93141_t3n... but that is only theoretically of course :)

Unfortunately, screen rotation 0 does not play well with the OV7670. You need to use rotation 3 to get the long axis of the display to align with the long axis of the camera window. Does the rotation matter as much if I am going to skip the computation of the diffs and always send the full image.

There is also the possibility of rotating the camera image to match the display with an external function like this:
Code:
void rotate(uint16_t *pS,  uint16_t *pD, uint16_t row,  uint16_t col) { 
    uint32_t r, c; 
    for (r = 0; r < row; r++)  { 
        for (c = 0; c < col; c++)  { 
            *(pD + c * row + (row - r - 1)) =    *(pS + r * col + c); 
        } 
    }
It should take only a millisecond or two to do that rotation. If run time becomes an issue I suppose I could pre-compute an array of output indices and the rotation would reduce to
Code:
for(i = 0; i< imageSize; i++ ) rotmap[rotindex[i]] = srcmap[i];

however the rotindex array has to be of type uint32_t, so it would require 307,200 bytes. There's plenty of room in EXTMEM, but the slower access times might end up taking more time than just computing the indices on the fly as in the first example.

Still on my to-do list is to learn to use the pixel pipeline hardware of the IMXRT1062. It seems that it should be able to handle the rotation of the image without extra computation time and memory.
 
Unfortunately, screen rotation 0 does not play well with the OV7670. You need to use rotation 3 to get the long axis of the display to align with the long axis of the camera window. Does the rotation matter as much if I am going to skip the computation of the diffs and always send the full image.

There is also the possibility of rotating the camera image to match the display with an external function like this:
Code:
void rotate(uint16_t *pS,  uint16_t *pD, uint16_t row,  uint16_t col) { 
    uint32_t r, c; 
    for (r = 0; r < row; r++)  { 
        for (c = 0; c < col; c++)  { 
            *(pD + c * row + (row - r - 1)) =    *(pS + r * col + c); 
        } 
    }
It should take only a millisecond or two to do that rotation. If run time becomes an issue I suppose I could pre-compute an array of output indices and the rotation would reduce to
Code:
for(i = 0; i< imageSize; i++ ) rotmap[rotindex[i]] = srcmap[i];

however the rotindex array has to be of type uint32_t, so it would require 307,200 bytes. There's plenty of room in EXTMEM, but the slower access times might end up taking more time than just computing the indices on the fly as in the first example.

Still on my to-do list is to learn to use the pixel pipeline hardware of the IMXRT1062. It seems that it should be able to handle the rotation of the image without extra computation time and memory.
Note: I will probably be playing at least a little some of the stuff in this library with the _t3n library to at a minimum set speeds and some rudimentary check for scan line and see how to do some quick and dirty synchronization at least as an experiment.

Also again remember back on the other thread: https://forum.pjrc.com/threads/6319...IRQ-Teensy-4-0?p=260348&viewfull=1#post260348
I created some configurations of the camera setup that instead of giving you a 320x240 image by shrinking by 2... in the camera it instead gives you a 240x320, by creating a window in the 640x480 raw image... Not sure if that would help you here or not.
 
@mborgeson

I wanted to see how to driver handles video so I hooked up an esp-cam streaming video to the Teensy via SPI (with Teensy as slave, receiving jpeg encoded images and decoding them on the fly) but It can only receive 320x240 frames at 20FPS. The images I get are very noisy and, as you predicted, this almost completely negates the use of the diff buffer :(

As such, the driver performs almost identically to Kurte's library and the max fps is pretty much capped by the SPI bus speed. And if you want to enable screen tearing prevention, then the fps will drop a few frames (and it did not make a significant visual improvement for video in my opinion). But of course, my image are really noisy and a lot of the cpu time is consumed by the jpeg decoding so this may also partly explain the relatively disappointing performances. I will try to investigate further and see if some filtering can help. In the meanwhile, you may not want to go to the hassle of rewiring everything if it is complicated... at least not yet :)

@Kurte

I will try to see what I can PR to your great library, especially since I have stolen all your DMA/SPI code !
I can only imagine how long it took you to understand all this, with all the corner cases and hard to debug bugs... Without your code I would have had no idea where to even begin, thanks ! For the time being I feel the library is neither mature nor stable enough to PR parts of it but I will not forget.
 
Yes, that's why I did not use it. But i guess it is OK for a lot of applications.

Yes I agree. It heavily depends on the application and where to set the trade-off fps drop / screen tearing ... Also, this frame drop is the reason I wanted to push diff updates to compensate somehow.

And speaking of applications, my underlying motivation is to use the driver with a custom 3D engine and I think this is exactly the kind of application where it should shine :)
 
Since using some of Vindar's functionality prefers rotation 0 on the ILI9341, I tested the time to rotate an image 90 degrees to work with that screen rotation.
Here are the results for rotating a QVGA image as a function of the memory type for the source and destination (rotated image):
Code:
Time to rotate QVGA image by 90 degrees
With optimization   Faster
Source	Dest		Time (microseconds)
--------------------------------
DTCM	DTCM		515   to 516
DMAMEM	DTCM		644   to 645
DMAMEM	DMAMEM		896   to 905
EXTMEM	DTCM		5430  to 5443
EXTMEM	DMAMEM		5516  to 5519
DMAMEM	EXTMEM		14180 to 14667
EXTMEM	EXTMEM		19186 to 19622

Here is the rotation code:

Code:
void rotate(uint16_t *pS,  uint16_t *pD, uint16_t row,  uint16_t col) { 
    uint32_t r, c; 
    for (r = 0; r < row; r++)  { 
        for (c = 0; c < col; c++)  { 
            *(pD + c * row + (row - r - 1)) =  *pS++; 
        } 
    }
}

As you can see from the data table, rotation times go up greatly when EXTMEM is involved. The penalty is pretty small if the rotated image is in DTCM and still under a millisecond if both source and destination are in DMAMEM.
 
Since using some of Vindar's functionality prefers rotation 0 on the ILI9341, I tested the time to rotate an image 90 degrees to work with that screen rotation.
Here are the results for rotating a QVGA image as a function of the memory type for the source and destination (rotated image):
Code:
Time to rotate QVGA image by 90 degrees
With optimization   Faster
Source	Dest		Time (microseconds)
--------------------------------
DTCM	DTCM		515   to 516
DMAMEM	DTCM		644   to 645
DMAMEM	DMAMEM		896   to 905
EXTMEM	DTCM		5430  to 5443
EXTMEM	DMAMEM		5516  to 5519
DMAMEM	EXTMEM		14180 to 14667
EXTMEM	EXTMEM		19186 to 19622

Here is the rotation code:

Code:
void rotate(uint16_t *pS,  uint16_t *pD, uint16_t row,  uint16_t col) { 
    uint32_t r, c; 
    for (r = 0; r < row; r++)  { 
        for (c = 0; c < col; c++)  { 
            *(pD + c * row + (row - r - 1)) =  *pS++; 
        } 
    }
}

As you can see from the data table, rotation times go up greatly when EXTMEM is involved. The penalty is pretty small if the rotated image is in DTCM and still under a millisecond if both source and destination are in DMAMEM.

The problem with rotating an image is that it is not possible to access both src and dst buffer linearly and when using EXTMEM, optimizing cache access is a must. As you can see EXTMEM -> DMAMEM is much faster than DMAMEM -> EXTMEM but I think that is mostly because you code accesses the src buffer linearly and not the destination. If you create a symmetric code but which accesses the destination linearly instead, I think DMAMEM -> EXTMEM should be much faster...
 
The problem with rotating an image is that it is not possible to access both src and dst buffer linearly and when using EXTMEM, optimizing cache access is a must. As you can see EXTMEM -> DMAMEM is much faster than DMAMEM -> EXTMEM but I think that is mostly because you code accesses the src buffer linearly and not the destination. If you create a symmetric code but which accesses the destination linearly instead, I think DMAMEM -> EXTMEM should be much faster...

I changed the algorithm to step incrementally in destination addresses and got the following:
Code:
DMAMEM	EXTMEM		7919  to 8109
DMAMEM  DMAMEM		1062  to 1110

here is the source for the modified rotate function:
Code:
// step through destination pixels in sequence
void rotate(uint16_t *pS,  uint16_t *pD, const uint16_t row, const  uint16_t col) {
  uint32_t r, c;
  for (r = 0; r < row; r++)  {
    for (c = 0; c < col; c++)  {
      *(pD++)  =  *(pS + c * row + (row - r - 1));
    }
  }
}

I'm not sure why the DMAMEM->DMAMEM times here are longer than when the code steps incrementally through the source pixels. It may have something to do with the fact that I had to swap the input dimensions to get the image to come out right.

In any case, having either source or destination in EXTMEM adds a heavy penalty in execution time just because of the slower access to the PSRAM.
 
In any case, having either source or destination in EXTMEM adds a heavy penalty in execution time just because of the slower access to the PSRAM.
As mentioned in other thread, I am playing with the camera as well and looking at custom camera setting to read the ILI9341 size image in other orientation...

Also other side notes: The PSRAM is configured by default to run at 88MHZ.

You might get away with speeding up the memory access:

It is configured in core startup.c FLASHMEM void configure_external_ram()

Code:
	// turn on clock  (TODO: increase clock speed later, slow & cautious for first release)
	CCM_CBCMR = (CCM_CBCMR & ~(CCM_CBCMR_FLEXSPI2_PODF_MASK | CCM_CBCMR_FLEXSPI2_CLK_SEL_MASK))
		| CCM_CBCMR_FLEXSPI2_PODF(5) | CCM_CBCMR_FLEXSPI2_CLK_SEL(3); // 88 MHz
This is configured for 88mhz:
So the CCM_CBCMR_FLEXSPI2_CLK_SEL(3)
Code:
Selector for flexspi2 clock multiplexer
00 derive clock from PLL2 PFD2
01 derive clock from PLL3 PFD0
10 derive clock from PLL3 PFD1
[COLOR="#FF0000"]11 derive clock from PLL2 (pll2_main_clk)[/COLOR]
The CCM_CBCMR_FLEXSPI2_PODF(5)
Code:
000 divide by 1
001 divide by 2
010 divide by 3
011 divide by 4
100 divide by 5
[COLOR="#FF0000"]101 divide by 6[/COLOR]
110 divide by 7
111 divide by 8
528mhz/6=88mhz...
If you change to CCM_CBCMR_FLEXSPI2_PODF(4) = 105.6mhz Which I have used
Or
If you change to CCM_CBCMR_FLEXSPI2_PODF(3) = 132mhz Which failed on one board I tried...
 
Hi again, I was thinking of adding in the ReadScanline like code into my library, but I am wondering if your current one is correct?
Code:
    int ILI9341Driver::_getScanLine(bool sync)
        {
        if (!sync)
            {
            return (((((uint64_t)_synced_em) * ILI9341_T4_NB_SCANLINE) / _period) + _synced_scanline) % ILI9341_T4_NB_SCANLINE;
            }
        int res[3] = { 255 }; // invalid value.
        _beginSPITransaction(_spi_clock_read);
        _maybeUpdateTCR(_tcr_dc_assert | LPSPI_TCR_FRAMESZ(7) | LPSPI_TCR_CONT);
        _pimxrt_spi->TDR = 0x45; // send command
        delayMicroseconds(5); // wait as requested by manual. 
        _maybeUpdateTCR(_tcr_dc_not_assert | LPSPI_TCR_FRAMESZ(7));
        _pimxrt_spi->TDR = 0; // send nothing[COLOR="#FF0000"]
        _maybeUpdateTCR(_tcr_dc_not_assert | LPSPI_TCR_FRAMESZ(7));[/COLOR]
        _pimxrt_spi->TDR = 0; // send nothing        
        uint8_t rx_count = 3;
        while (rx_count)
            { // receive answer. 
            if ((_pimxrt_spi->RSR & LPSPI_RSR_RXEMPTY) == 0)
                {
                res[--rx_count] = _pimxrt_spi->RDR;
                }
            }
        _synced_em = 0;
        _synced_scanline = res[0];
        _endSPITransaction();
        return res[0];
        }

In particular: I believe from the spec that there is a command and 3 bytes of parameters: where one is a dummy byte returned followed by MSB (2 bits) and LSB (8 bits)
I am only seeing two parameter bytes being sent? Paul's version in ILI9341_t3 makes more sense:
Code:
	beginSPITransaction(ILI9341_SPICLOCK_READ);
	if (_dcport) {
		// DC pin is controlled by GPIO
		DIRECT_WRITE_LOW(_dcport, _dcpinmask);
		IMXRT_LPSPI4_S.SR = LPSPI_SR_TCF | LPSPI_SR_FCF | LPSPI_SR_WCF;
		IMXRT_LPSPI4_S.TCR = LPSPI_TCR_FRAMESZ(7) | LPSPI_TCR_RXMSK | LPSPI_TCR_CONT;
		IMXRT_LPSPI4_S.TDR = 0x45;
		while (!(IMXRT_LPSPI4_S.SR & LPSPI_SR_WCF)) ; // wait until word complete
		DIRECT_WRITE_HIGH(_dcport, _dcpinmask);
		IMXRT_LPSPI4_S.TDR = 0;
		IMXRT_LPSPI4_S.TCR = LPSPI_TCR_FRAMESZ(15);
		IMXRT_LPSPI4_S.TDR = 0;
		while (!(IMXRT_LPSPI4_S.SR & LPSPI_SR_WCF)) ; // wait until word complete
		while (((IMXRT_LPSPI4_S.FSR >> 16) & 0x1F) == 0) ; // wait until rx fifo not empty
		line = IMXRT_LPSPI4_S.RDR >> 7;
		//if (IMXRT_LPSPI4_S.FSR != 0) Serial.println("ERROR: junk remains in FIFO!!!");
	} else {
		// DC pin is controlled by SPI CS hardware
		// TODO...
	}
	endSPITransaction();
	return line;
What I am wondering is if the line I highlighted in your code mig
Code:
        _maybeUpdateTCR(_tcr_dc_not_assert | LPSPI_TCR_FRAMESZ(15));[/COLOR]
Which would imply the last thing returned in RDR
Although I am also wondering about the >> 7 in Paul's?
 
Hi again, I was thinking of adding in the ReadScanline like code into my library, but I am wondering if your current one is correct?
Code:
    int ILI9341Driver::_getScanLine(bool sync)
        {
        if (!sync)
            {
            return (((((uint64_t)_synced_em) * ILI9341_T4_NB_SCANLINE) / _period) + _synced_scanline) % ILI9341_T4_NB_SCANLINE;
            }
        int res[3] = { 255 }; // invalid value.
        _beginSPITransaction(_spi_clock_read);
        _maybeUpdateTCR(_tcr_dc_assert | LPSPI_TCR_FRAMESZ(7) | LPSPI_TCR_CONT);
        _pimxrt_spi->TDR = 0x45; // send command
        delayMicroseconds(5); // wait as requested by manual. 
        _maybeUpdateTCR(_tcr_dc_not_assert | LPSPI_TCR_FRAMESZ(7));
        _pimxrt_spi->TDR = 0; // send nothing[COLOR="#FF0000"]
        _maybeUpdateTCR(_tcr_dc_not_assert | LPSPI_TCR_FRAMESZ(7));[/COLOR]
        _pimxrt_spi->TDR = 0; // send nothing        
        uint8_t rx_count = 3;
        while (rx_count)
            { // receive answer. 
            if ((_pimxrt_spi->RSR & LPSPI_RSR_RXEMPTY) == 0)
                {
                res[--rx_count] = _pimxrt_spi->RDR;
                }
            }
        _synced_em = 0;
        _synced_scanline = res[0];
        _endSPITransaction();
        return res[0];
        }

In particular: I believe from the spec that there is a command and 3 bytes of parameters: where one is a dummy byte returned followed by MSB (2 bits) and LSB (8 bits)
I am only seeing two parameter bytes being sent? Paul's version in ILI9341_t3 makes more sense:
Code:
	beginSPITransaction(ILI9341_SPICLOCK_READ);
	if (_dcport) {
		// DC pin is controlled by GPIO
		DIRECT_WRITE_LOW(_dcport, _dcpinmask);
		IMXRT_LPSPI4_S.SR = LPSPI_SR_TCF | LPSPI_SR_FCF | LPSPI_SR_WCF;
		IMXRT_LPSPI4_S.TCR = LPSPI_TCR_FRAMESZ(7) | LPSPI_TCR_RXMSK | LPSPI_TCR_CONT;
		IMXRT_LPSPI4_S.TDR = 0x45;
		while (!(IMXRT_LPSPI4_S.SR & LPSPI_SR_WCF)) ; // wait until word complete
		DIRECT_WRITE_HIGH(_dcport, _dcpinmask);
		IMXRT_LPSPI4_S.TDR = 0;
		IMXRT_LPSPI4_S.TCR = LPSPI_TCR_FRAMESZ(15);
		IMXRT_LPSPI4_S.TDR = 0;
		while (!(IMXRT_LPSPI4_S.SR & LPSPI_SR_WCF)) ; // wait until word complete
		while (((IMXRT_LPSPI4_S.FSR >> 16) & 0x1F) == 0) ; // wait until rx fifo not empty
		line = IMXRT_LPSPI4_S.RDR >> 7;
		//if (IMXRT_LPSPI4_S.FSR != 0) Serial.println("ERROR: junk remains in FIFO!!!");
	} else {
		// DC pin is controlled by SPI CS hardware
		// TODO...
	}
	endSPITransaction();
	return line;
What I am wondering is if the line I highlighted in your code mig
Code:
        _maybeUpdateTCR(_tcr_dc_not_assert | LPSPI_TCR_FRAMESZ(15));[/COLOR]
Which would imply the last thing returned in RDR
Although I am also wondering about the >> 7 in Paul's?

Hi,

Yes, Paul's version probably better than mine :)

Indeed, I am missing a bit in my code since I am not using the lsb from the second byte received. I do not kwow why but when I experimented first, I had reliability issues with this last bit which seemed to randomly flip so I decided to just drop it. That is why my scanline only go up to 161 (= 362 / 2 = 360 + frontporch + backproch). Losing 1 bit precision does not matter for the vsync purpose so I never try to fix this ! I guess Paul 's version is the correct one and his >> 7 shift gives him the complete 9bits scanline.

Anyway, the good thing is that getting the scanline does not need to be particularly fast because it can be then be extrapolated once the display refresh rate has been sampled (beware: it really changed from display to display so it need to be sampled at initialization). This extrapolation is stable enough to give the exact scanline for a few frames before it starts drifting. This is why I only query the scanline once a the start of each frame to resync but not during sub-frames synchronization.
 
Also I just realize a few things today which gave me a huge improvement in the upload speed:

- The PASET and CASET commands are supposed to send two 16bits values: the end and the start value. But in fact, it is possible to send only the first word (the start index) and omit the second (just set it to the max value possible). Also, if the x or y coordinate does not change, the corresponding PASET/CASET command need not be resend before the next RAWRW command. With these shortcut, it turns out that, in average, a "repositioning sequence" CASET/PASET/RAWRW cost no more than what it take to send 3/4 pixels. Thus is is possible to get a much finer diff and break the stream of pixels much more often which gives and a faster overall upload rate. The 99luftballons example went from 70fps to almost 90fps at full speed.
 
Sounds like a good win in your case!

I played some with that in earlier stuff (more with skipping whole PASET and CASET) and in generic case with different processors with different SPI stuff and maybe not DC on hardware CS case... I punted at least then.

Note: I am still playing around with some of the getscan code and running into some interesting issues of trying to get valid data back.

But one difference is I am trying it on SPI instead of SPI1 which should make no difference AND DC is not on the hardware CS pin, which I would like to make work as:
For example the Test board I am playing with has DC on pin 9... Also figured out my code I posted HAS an issue with which pin is the CS pin (@defragster and @mjs513) the CS pin is pin 7 not 10 :eek:

Now I could redo the board and change DC from 9 to 10, BUT on this board I did not use pin 10 for SPI as I added the ability to for sound I wanted to hook it up to one of the MQS pins (10 or 12), could not be 12 as only MISO pin... So choose 10...

Probably can not move to SPI1 as it is setup to use CSI pins: On SPI1 you are using pins 26=MOSI1 and 27=SCK1 but these pins are also the only CSI_D3 and CSI_D2 pins... Could maybe use SPI2, but those are only on either bottom pads which I am using for external memory or SDCard, which could use sd adapter but would prefer not to... SO the fun of so many functions and so few pin choices.

Now back to debugging.
 
If you are curious I thought I would show the differences in Logic Analyzer differences of running your library with your 99luft... example as the logic analyzer screen on top and running with my code in ILI9341_t3n, but again with software run DC...

Your code is showing a reasonable progression of numbers, Mine so far is not...
They both output the 0x45 have about a 3us gap then output two 0s... The one difference I see is yours is holding the clock line high between the 0x45 and the first 0 output... They both are asking for CONT bit. But maybe difference for DC... Still playing.
screenshot.jpg
 
Sounds like a good win in your case!
...
But one difference is I am trying it on SPI instead of SPI1 which should make no difference AND DC is not on the hardware CS pin, which I would like to make work as:
For example the Test board I am playing with has DC on pin 9... Also figured out my code I posted HAS an issue with which pin is the CS pin (@defragster and @mjs513) the CS pin is pin 7 not 10 :eek:
...

Good to know you found the sketch CS isn't 10 - but 7. It seems I found and noted that once on an OV Cam thread and had to wonder ... but have been away from that after seeing hardware work with @mborgerson code
 
Back
Top