T4.x DMAMEM and RA8876 and SPI - (Paul?) - Large image does not display correct...

KurtE

Senior Member+
Thought about just adding this to the RA8876 thread, BUT there is an interesting issue having to do (I think) with SPI doing DMA transfer from DMAMEM not getting the right bits...

Note for those with RA8876, this is a version of the embeded picture sketch that I have been playing with, but I made it for the current MASTER branch without PRs

that is it uses: tft.putPicture(start_x, start_y, image_width, image_height, (const unsigned char*)image);
So it might be interesting to see if you see similar results.

The whole sketch is included.

Some bits and pieces to explain what is happening.

The main image is converted part of the sketch:
Code:
// Generated by   : ImageConverter 565 Online
// Generated from : T4.1-Cardlike.gif
// Time generated : Sat, 08 Aug 20 02:16:56 +0200  (Server timezone: CET)
// Image Size     : 575x424 pixels
// Memory usage   : 487600 bytes


const unsigned short teensy41_Cardlike[243800] PROGMEM={
0xCE79, 0xEF7D, 0xDEFB, 0xEF7D, 0xDEFB, 0xEF7D, 0xDEFB, 0xEF7D, 0xDEFB, 0xEF7D, 0xDEFB, 0xCE79, 0xEF7D, 0xDEFB, 0xEF7D, 0xDEFB,   // 0x0010 (16) pixels
0xEF7D, 0xDEFB, 0xEF7D, 0xDEFB, 0xEF7D, 0xDEFB, 0xEF7D, 0xDEFB, 0xEF7D, 0xDEFB, 0xEF7D, 0xCE79, 0xDEFB, 0xDF7D, 0xEEFD, 0xDF7D,   // 0x0020 (32) pixels
0xEEFD, 0xDF7D, 0xEEFD, 0xDF7D, 0xEEFD, 0xDF7D, 0xEEFD, 0xDF7D, 0xEEFD, 0xDF7D, 0xEEFD, 0xDF7D, 0xEEFD, 0xDF7D, 0xEEFD, 0xDF7D,   // 0x0030 (48) pixels
...

And I have function that clears centers the image and draws the rest of the screen to some color. It also puts up text showing how long the call in this case to putPicture took. It also draws a second number of how long it took including the output of the first number. This was done as the internal code in RA8876 calls async SPI.transfer(buffer, nullptr, size, event) so the transfer is still happening when the first call returned, but then any other call will wait for current transfer to complete before it then does it's outputs. So when I output the image which is stored in PROGMEM, it looks like:
IMG_1209.jpg

But if I copy this image from PROGMEM to DMAMEM it does not all get through... To show it, I earlier fill the whole very large array with the color RED. The main loop code:

Code:
void loop(void) {
  tft.setFont(ComicSansMS_24);
  tft.fillScreen(RED);

  // Lets put something into the DMAMMEM that is different...
  for (uint32_t i = 0; i < sizeof(teensy41_Cardlike_dmamem)/sizeof(teensy41_Cardlike_dmamem[0]); i++)
    teensy41_Cardlike_dmamem[i] = RED;
    
  Serial.print("Display T4.1 Extended card ");
  drawImage(575, 424, (uint16_t*)teensy41_Cardlike, BLUE);
  if (DelayOrStep()) return;
  Serial.print("DMAMEM Display T4.1 Extended card ");
  // Lets make a DMAMEM version of the card to see if it likes it or not...
  memcpy((void *)teensy41_Cardlike_dmamem, (const void *)teensy41_Cardlike, sizeof(teensy41_Cardlike));
  drawImage(575, 424, (uint16_t*)teensy41_Cardlike_dmamem, GREEN);
  if (DelayOrStep()) return;
  // 
  Serial.print("DMAMEM 2nd time Display T4.1 Extended card ");
  drawImage(575, 424, (uint16_t*)teensy41_Cardlike_dmamem, RED);
  if (DelayOrStep()) return;

}
The one from DMAMEM does not always draw the same: But often times looks like:
IMG_1211.jpg
You see the red streaks...

Now if we unwind the putPicture code:
Code:
void RA8876_t3::putPicture(ru16 x, ru16 y, ru16 w, ru16 h, const unsigned char *data) {
	//The putPicture_16bppData8 function in the base class is not ideal - it damages the activeWindow setting
	//It also is harder to make it DMA.
	//Ra8876_Lite::putPicture_16bppData8(x, y, w, h, data);

	//Using the BTE function is faster and will use DMA if available
    bteMpuWriteWithROPData8(currentPage, width(), x, y,  //Source 1 is ignored for ROP 12
                              currentPage, width(), x, y, w, h,     //destination address, pagewidth, x/y, width/height
                              RA8876_BTE_ROP_CODE_12,
                              data);
}

Which is:
Code:
void RA8876_t3::bteMpuWriteWithROPData8(ru32 s1_addr,ru16 s1_image_width,ru16 s1_x,ru16 s1_y,ru32 des_addr,ru16 des_image_width,
ru16 des_x,ru16 des_y,ru16 width,ru16 height,ru8 rop_code,const unsigned char *data)
{
  bteMpuWriteWithROP(s1_addr, s1_image_width, s1_x, s1_y, des_addr, des_image_width, des_x, des_y, width, height, rop_code);
  
  startSend();
  _pspi->transfer(RA8876_SPI_DATAWRITE);

#ifdef SPI_HAS_TRANSFER_ASYNC
  activeDMA = true;
  _pspi->transfer(data, NULL, width*height*2, finishedDMAEvent);
#else
  //If you try _pspi->transfer(data, length) then this tries to write received data into the data buffer
  //but if we were given a PROGMEM (unwriteable) data pointer then _pspi->transfer will lock up totally.
  //So we explicitly tell it we don't care about any return data.
  _pspi->transfer(data, NULL, width*height*2);
  endSend(true);
#endif
}
So after it configures stuff it calls the SPI transfer...

And note, the SPI transfer function in this case has:
Code:
	if (buf) {
		_dmaTX->sourceBuffer((uint8_t*)write_data, count);  
		_dmaTX->TCD->SLAST = 0;	// Finish with it pointing to next location
		if ((uint32_t)write_data >= 0x20200000u)  arm_dcache_flush(write_data, count);
To try to flush all: 243800 * 2 bytes from the cache.

Suggestions?

EDIT: Looks like I should have rotated the first picture 180 degrees, but...
 

Attachments

  • RA8876_pictureEmbed_test_mem-200811a.zip
    244.2 KB · Views: 69
Notes on SPI DMA, and things I have tried.

The SPI DMA code does not chain DMASetting objects, so at most any one DMA operation can do something like 32767 bytes transfer. And that is what the released code is doing.
It sets up the DMA operation, for the MAX, and we interrupt on completion, at which point, we decrement a count of how much is still left to transfer and we start it up again, and repeat until the count remaining goes to 0...

So for this transfer this requires us to actually do something like 15 transfers.

I thought maybe the transfers of 32767 bytes might be an issue that we are not starting off secondary transfers on 32 byte boundary, so tried changing the MAX transfer size in the class
from 32767 to 32736 (multiple of 32). and it did not appear to make a difference.

Also thought maybe add code to flush the cache in the ISR when it is about to start the next transfer. Also did not appear to make a difference.

You might notice in the sketch I actually output the copy into DMAMEM twice to see if it made difference... It did not appear to. Note some iterations through the image might work...

Next experiment. Try changing my fill RED to go from end of memory to start and see if changes anything.... (Like maybe which things are actually cached...?)
 
I was curious, so wondered what it would do on T4.1 with PSRAM...
So updated sketch, to see if building on T4.1 and at run time it knows it has PSRAM, if so it also copies the image to the equivalent PSRAM array and also did stuff to write garbage out to it first (All black) and then the copy and then the output routine.

And it appears like the image is drawing correctly from PSRAM...

Again the changes to the above sketch:

Up in the global area added:
Code:
#if defined(ARDUINO_TEENSY41)
unsigned short teensy41_Cardlike_extmem[243800] EXTMEM;
extern "C"
{
  extern uint8_t external_psram_size;
}

Currently the loop function:
Code:
void loop(void) {
  tft.setFont(ComicSansMS_24);
  tft.fillScreen(RED);

  // Lets put something into the DMAMMEM that is different...
  //  for (uint32_t i = 0; i < sizeof(teensy41_Cardlike_dmamem)/sizeof(teensy41_Cardlike_dmamem[0]); i++)
  for (int i = sizeof(teensy41_Cardlike_dmamem) / sizeof(teensy41_Cardlike_dmamem[0]) - 1; i >= 0; i--)
    teensy41_Cardlike_dmamem[i] = BLACK;
#if defined(ARDUINO_TEENSY41)
  if (external_psram_size > 0) {
    for (uint32_t i = 0; i < sizeof(teensy41_Cardlike_dmamem) / sizeof(teensy41_Cardlike_dmamem[0]); i++)
      teensy41_Cardlike_dmamem[i] = BLACK;
  }
#endif
  Serial.print("Display T4.1 Extended card ");
  drawImage(575, 424, (uint16_t*)teensy41_Cardlike, BLUE);
  if (DelayOrStep()) return;
  Serial.print("DMAMEM Display T4.1 Extended card ");
  // Lets make a DMAMEM version of the card to see if it likes it or not...
  memcpy((void *)teensy41_Cardlike_dmamem, (const void *)teensy41_Cardlike, sizeof(teensy41_Cardlike));
  drawImage(575, 424, (uint16_t*)teensy41_Cardlike_dmamem, GREEN);
  if (DelayOrStep()) return;
  //
  Serial.print("DMAMEM 2nd time Display T4.1 Extended card ");
  drawImage(575, 424, (uint16_t*)teensy41_Cardlike_dmamem, RED);
  if (DelayOrStep()) return;
#if defined(ARDUINO_TEENSY41)
  if (external_psram_size > 0) {
    memcpy((void *)teensy41_Cardlike_extmem, (const void *)teensy41_Cardlike, sizeof(teensy41_Cardlike));
    Serial.print("EXTMEM Display T4.1 Extended card ");
    drawImage(575, 424, (uint16_t*)teensy41_Cardlike_extmem, CRIMSON);
    if (DelayOrStep()) return;
  }
#endif

}

Note: it is significantly slower to output that much data from the PSRAM than either FLASH or EXTMEM:


Code:
Start RA8876 picture embed testScreen Width:1024 Height: 600
entering an 's' char will toggle on/off step mode
Display T4.1 Extended card Image: 7 Total: 115
Press any key to continue
DMAMEM Display T4.1 Extended card Image: 7 Total: 115
Press any key to continue
DMAMEM 2nd time Display T4.1 Extended card Image: 7 Total: 115
Press any key to continue
EXTMEM Display T4.1 Extended card Image: 7 Total: 278
Press any key to continue

Again the timings all were about 7 to return from the function, which is not waiting for the SPI operation to complete.
The run from FLASH and the 2 runs from DMAMEM both took about 115ms and drawingfrom PSRAM took 278ms.
 
@KurtE
Just pulled out my display and ran your test sketch. Am seeing the same thing as you described. Interesting is that it seems to occur in the lower half of the image and the position changes as you cycle through the image updates.
 
@mjs513 - Thanks,

It has been strange, I was seeing this when I tried doing this with malloc created image. Where I had/have version of code that rearranges the bytes such that you can do a simple one output in different orientations...

Earlier I ran this and captured it with LA and you could see the pieces where the output was black like we are seeing.

Wondering if it would make sense to try to rewrite how the SPI code works here and see if it makes a difference?
As I mentioned currently it does not setup any chain of DMASetting objects, but instead maybe interrupts and resets the count and tells it to go again... Which has worked.

I probably don't want to setup a maximum possible chain, with like 512KB/32k or a chain of 17 of them?
But could do a chain of 2 of them, again not sure if that would make any difference or not.

I suppose I could try it and see...
 
@KurtE
Maybe chain 2 DMA transfers with half of 32K? Wonder if that you work.
Something like that is what I am going to try. I will probably setup to do each DMASetting object to do up to 32K-32bytes transfers, such that if what was passed in was 32 byte aligned all of the entries will continue to be... Actually not sure that makes a difference, but...

Also not sure I have high hopes for it actually working any different. Than simply restarting SPI N times to finish, but maybe will get lucky.
 
Just about to start hacking on SPI.

Before I started, I thought I would show a couple of screen captures with the Logic Analyzer showing the issues and differences in speed.

Here is a quick run of the sketch I mentioned with the edits to also use PSRAM...

screenshot.jpg

Note all 4 groups of screen update are using the same code to output, the only difference is memory locations used.
The first is the FLASH memory image, the next two are from DMAMEM. Note: I started off in each pass to zero out memory (Wrote BLACK) to the memory and then right before I output, I do a memcpy from the flash memory to DMAMEM.

The last one is like the DMAMEM version, except it is to external memory PSRAM... You can see how much slower it is. Again same code but slowed down as I assume the reading that much data from PSRAM takes it longer...

Then if I zoom in to one of the outputs from DMAMEM, you can see gaps in the output (all zeros), where the data is not taken from the actual image but instead from the data I wrote earlier... In other versions I wrote RED first and RED showed up in these gaps. And note the gaps don't always show up or show up in the exact same locations.
screenshot2.jpg

Now off to code hacking

Edit: Thought I would again mention, using the Saleae Logic Beta builds (they hope to release a version soon)... One thing I missed from Version 1 is there is no longer commands and like to save images to clipboard or file...
But finding that I don't totally miss it any more with Windows 10 (snip and Sketch).
You hit WINDOWS+SHIFT+S - it then allows you to select portion of screen, to put on clipboard, and message at bottom if you click on brings up app, which has features to allow you to save file...
 
@KurtE
Suggestion was based on what you did previously for one of the display drivers :) but seemed like it would fit.

Those LA screen shots really show the issue clearly, showing the gaps in the data. Wonder if its more a problem with the RA8876?
 
Yep - I was trying to do a quick and dirty just using SPI to verify that it has nothing to do with display...

So I was trying to figure out a quick and dirty way to define a PROGMEM large array to be intialized... So far this has not worked:

Code:
const unsigned short teensy41_Cardlike[243800] PROGMEM = {[0 ... 243799] = 0xf800};

Compiler does not like it, probably needs a different version of GCC compiler... Will do it with a large init... instead.
 
@mjs513 @Paul...

I am not totally concentrating today, so may not get stuff finished today...
I did hack up a version of the code to just call SPI and ran without display... My guess is there are some main issues with the restart code of SPI, so will rework.

I hacked up a sketch at first was output all RED but decided to just output 0xffff...

Could include the file all_red.h which is now WHITE ;)

Code:
#include <SPI.h>
#include <EventResponder.h>
#include "All_Red.h"

//const unsigned short all_red[243800] PROGMEM={

const int ARRAY_SIZE = sizeof(all_red)/sizeof(all_red[0]);
unsigned short all_red_dmamem[ARRAY_SIZE] DMAMEM;

EventResponder event;
volatile bool event_happened;

#if defined(ARDUINO_TEENSY41)
unsigned short all_red_extmem[ARRAY_SIZE] EXTMEM;
extern "C"
{
  extern uint8_t external_psram_size;
}
#endif

void ev_function (EventResponderRef ev) {
  event_happened = true;
}

void setup() {
  while (!Serial && millis() < 4000) ;
  Serial.begin(115200);
  pinMode(10, OUTPUT);
  digitalWriteFast(10, HIGH);
  SPI.begin();
  event.attachImmediate(ev_function);
  Serial.println(ARRAY_SIZE, DEC);
}

void test(const unsigned short *image, uint32_t size_image) {
  event.clearEvent();
  event_happened = false;
  digitalWriteFast(10, LOW);
  SPI.beginTransaction(SPISettings(60000000, MSBFIRST, SPI_MODE0));
  SPI.transfer(image, nullptr, size_image, event);
  while (!event_happened) ;
  SPI.endTransaction();
  digitalWriteFast(10, HIGH);
}

void loop() {
  Serial.println("Press any key to continue");
  while (Serial.read() == -1) ;
  while (Serial.read() != -1);
  // Lets put something into the DMAMMEM that is different...
  //  for (uint32_t i = 0; i < sizeof(all_red_dmamem)/sizeof(all_red_dmamem[0]); i++)
  for (int i = ARRAY_SIZE - 1; i >= 0; i--)
    all_red_dmamem[i] = 0;
#if defined(ARDUINO_TEENSY41)
  if (external_psram_size > 0) {
    for (uint32_t i = 0; i < ARRAY_SIZE; i++)
      all_red_dmamem[i] = 0;
  }
#endif
  test(all_red, sizeof(all_red));
  delay(10);
  memcpy((void*)all_red_dmamem, (const void *)all_red, sizeof(all_red));
  test(all_red_dmamem, sizeof(all_red_dmamem));
#if defined(ARDUINO_TEENSY41)
  if (external_psram_size > 0) {
    delay(10);
    memcpy((void*)all_red_extmem, (const void *)all_red, sizeof(all_red));
    test(all_red_extmem, sizeof(all_red_extmem));
  }
#endif
}
I included same file as before, but dis a search and replace of all of the hex numbers and set to FFFF.

screenshot.jpg

But you actually see some noise on all three. The one from FLASH is showing some trailing 0s...

The other two are showing more junk as you can see and showing some stuff on MISO pin as well...
I had nothing connected to it. Don't remember if we init to PU or PD or not.
 
@mjs513 @Paul ... You might my quick and dirty change to SPI.

Turns out I was not arm_dcache_flush of the whole count of bytes being output. Only the max I could do for one operation...
I updated for whole operation and things look cleaner!

Will do PR after more testing:
https://github.com/KurtE/SPI/tree/T4_DMA_Flush_Whole_buffer

See got ambitious this afternoon :) Anyway I downloaded the PR that you reference and reran your original test sketch and it appears that it fixed the issued with the streaks for the images. Ran through the sequence several times and as fast as i could hit the enter key and it all seemed to work no issue.

I got the following times on the first pass:
Code:
Start RA8876 picture embed testScreen Width:1024 Height: 600
entering an 's' char will toggle on/off step mode

Display T4.1 Extended card Image: 7 Total: 115
Press any key to continue

DMAMEM Display T4.1 Extended card Image: 7 Total: 115
Press any key to continue

DMAMEM 2nd time Display T4.1 Extended card Image: 7 Total: 115
Press any key to continue

EXTMEM Display T4.1 Extended card Image: 8 Total: 279
Press any key to continue

Display T4.1 Extended card Image: 8 Total: 116
Press any key to continue
Didn't notice any time differences.
 
Thanks @mjs513 - I removed the lines I commented out and then retried with that test.

I also selected back my PR request branch for RA8876, and the test sketch has gone through all 4 rotations and I am no longer seeing that issue with parts of the image not showing correctly...

So issued Pull Request to SPI: https://github.com/PaulStoffregen/SPI/pull/61

Probably will now put the RA8876 back on the shelf..
 
Yep - unless I get that burr under saddle again to add something else ;) Will steal that T4.1 back with the extra memory for other projects.
 
Back
Top