T4 DMA and Memory - DMAMEM and malloc/new

Status
Not open for further replies.

KurtE

Senior Member+
Sorry, I know that I have brought some of this up a few times during the beta thread, but still running into issues, that I am trying to figure out how best to solve.

Probably the most recent and verbose :eek: was at: https://forum.pjrc.com/threads/54711-Teensy-4-0-First-Beta-Test?p=211230&viewfull=1#post211230

Put simply: If you are trying to use DMA to either send data to a device or receive data from a device, and the memory is from the upper memory area, there are some interesting things to think about and figure out what to do. You will get memory from these locations if you either declare your variables as DMAMEM or by using malloc or new.

The issue is that DMA operations work directly with the actual physical memory, where simply reading and writing variables in this range go through a cache and the actual updated values from setting a value may or may not have actually made it to the physical memory and likewise any updates by DMA to actual memory may not be seen by those things fetching from the cache...

There are system functions available to help out in some cases, which for example I put into the SPI library when we do non-blocking transfers.

For example if you are doing a transfer from memory to a device: I call arm_dcache_flush(buffer, count);
Which tells the system to flush the cache in that range and size back to physical memory.

Likewise if you are doing an SPI transfer from a device back into memory, we call arm_decache_delete(retbuf, count), which tells the system to delete it's cache such that later reads will retrieve the data back from the physical memory...

With some of our display driver code, this can work for single updates, where before we tell the system to update our display from frame buffer, we flush the memory. But breaks down if we try to do continuous updates. Could go into details, but... So to work around this I instead have two smaller buffers, as part of the display object, which I copy the data out of the frame buffer and then output from there... Which works OK, when the display object is a static object, as was by all of my test cases... UNTIL:

Uncanny Eyes, I am trying a version where I have one display on SPI and another on SPI1, and am trying to have the two displays update at same time using DMA and it is NOT working!
Then found that this program is doing a new operation for the displays. Which is working except for my DMA support!

To verify this, I made a version of one of my st77xx test programs that allow me to either use a static tft object or one created by using new, and sure enough the one done by new is failing...
View attachment st7735_t3_simpletest_FB-190814a.zip
If you build it as is, it will use static object... If you uncomment the first line #define...
It will do a new...

Hitting CR - will go through the different orientations of the screen>
hitting a<CR> - will update the screen using async (one time)
hitting c<CR> - Will do a continuous update for several frames.

These asynch ones will fail.

And I am now trying to fully understand and figure out how/if I can resolve. Some of the issues are (I think):

a) My smaller buffers are in high memory so again that cache issue. That part is not hard to work around:
Code:
		if (_dma_sub_frame_count & 1) {
			memcpy(_dma_buffer1, &_pfbtft[_dma_pixel_index], _dma_buffer_size*2);
			if ((uint32_t)_dma_buffer1 >= 0x20200000u)  arm_dcache_flush(_dma_buffer1, _dma_buffer_size*2);
		} else {			
			memcpy(_dma_buffer2, &_pfbtft[_dma_pixel_index], _dma_buffer_size*2);
			if ((uint32_t)_dma_buffer1 >= 0x20200000u)  arm_dcache_flush(_dma_buffer2, _dma_buffer_size*2);
		}

Note: the DMA ISR does get called a couple of times before it does not work (needs to get called several times).
Right now I have some simple debug printing turned on. Prints one . each time the ISR is entered, a ! if 2 before last one and $ for last one of frame...

So some debug output in the working case:
Code:
init CS:10 DC:9 MOSI:11 SCLK:13 RST:20
13:01:04.464 ->  Row Start:3  Col Start: 2
13:01:04.464 -> Set Rotation: 0 width: 128 height: 128
13:01:04.504 -> Hit any key to continue
13:01:04.504 -> Set Rotation: 1 width: 128 height: 128
13:01:04.504 -> Hit any key to continue
13:01:09.397 -> DMA Init buf size: 512 sub frames:32 spi num: 0
13:01:09.397 -> 20001b20 400e9020:SA:20001b34 SO:2 AT:101 NB:2 SL:-1024 DA:403a0064 DO: 0 CI:200 DL:20001b00 CS:12 BI:200
13:01:09.397 -> 20001aa0 20001ac0:SA:20001b34 SO:2 AT:101 NB:2 SL:-1024 DA:403a0064 DO: 0 CI:200 DL:20001b00 CS:12 BI:200
13:01:09.397 -> 20001ae0 20001b00:SA:20001f34 SO:2 AT:101 NB:2 SL:-1024 DA:403a0064 DO: 0 CI:200 DL:20001ac0 CS:12 BI:200
13:01:09.397 -> After Async Update
13:01:09.397 -> ..............................!
13:01:09.437 -> ..*
13:01:09.437 -> $
13:01:09.437 -> Async completed 14
13:01:10.477 -> Set Rotation: 2 width: 128 height: 128
13:01:10.477 -> Hit any key to continue
13:01:10.837 -> Set Rotation: 3 width: 128 height: 128
13:01:10.877 -> Hit any key to continue
13:01:11.717 -> Start Continuous update test
13:01:11.717 -> 20001b20 400e9020:SA:20001b34 SO:2 AT:101 NB:2 SL:-1024 DA:403a0064 DO: 0 CI:200 DL:20001b00 CS:12 BI:200
13:01:11.717 -> 20001aa0 20001ac0:SA:20001b34 SO:2 AT:101 NB:2 SL:-1024 DA:403a0064 DO: 0 CI:200 DL:20001b00 CS:12 BI:200
13:01:11.717 -> 20001ae0 20001b00:SA:20001f34 SO:2 AT:101 NB:2 SL:-1024 DA:403a0064 DO: 0 CI:200 DL:20001ac0 CS:12 BI:200
13:01:11.717 -> After updateScreenAsync
13:01:11.717 -> ................................*
13:01:11.756 -> ................................*
13:01:11.756 -> ................................*
13:01:11.756 -> ................................*
13:01:11.797 -> ................................*
13:01:11.797 -> ................................*
13:01:11.797 -> ................................*
13:01:11.797 -> ................................*
13:01:11.876 -> ................................*
13:01:11.876 -> ................................*
13:01:11.876 -> ................................*
13:01:11.876 -> ................................*
13:01:11.917 -> ................................*
13:01:11.917 -> ................................*
13:01:11.917 -> ................................*
13:01:11.957 -> ................................*
13:01:11.957 -> ................................*
13:01:11.957 -> ................................*
13:01:11.997 -> ................................*
13:01:11.997 -> ................................*
13:01:12.036 -> ................................*
13:01:12.036 -> ................................*
13:01:12.036 -> ................................*
13:01:12.077 -> ................................*
13:01:12.077 -> ................................*
13:01:12.077 -> ................................*
13:01:12.077 -> ................................*
13:01:12.117 -> ................................*
13:01:12.117 -> ................................*
13:01:12.156 -> ................................*
13:01:12.156 -> ................................*
13:01:12.206 -> ................................*
13:01:12.206 -> ................................*
13:01:12.206 -> ................................*
13:01:12.206 -> ................................*
13:01:12.247 -> Finished all frames
13:01:12.247 -> After call to endUpdateAsync
13:01:12.247 -> ..............................!
13:01:12.247 -> ..*
13:01:12.247 -> $
13:01:12.247 -> Test completed

So if I try now to run it with a new ST7735...
Code:
13:32:46.209 -> init CS:10 DC:9 MOSI:11 SCLK:13 RST:20
13:32:46.209 ->  Row Start:3  Col Start: 2
13:32:46.209 -> Set Rotation: 0 width: 128 height: 128
13:32:46.209 -> Hit any key to continue
13:32:58.775 -> Set Rotation: 1 width: 128 height: 128
13:32:58.815 -> Hit any key to continue
13:32:59.575 -> Set Rotation: 2 width: 128 height: 128
13:32:59.655 -> Hit any key to continue
13:33:01.656 -> DMA Init buf size: 512 sub frames:32 spi num: 0
13:33:01.656 -> 20200128 400e9020:SA:2020013c SO:2 AT:101 NB:2 SL:-1024 DA:403a0064 DO: 0 CI:200 DL:20200108 CS:f812 BI:200
13:33:01.656 -> 202000a8 202000c8:SA:2020013c SO:2 AT:101 NB:2 SL:-1024 DA:403a0064 DO: 0 CI:200 DL:20200108 CS:f812 BI:200
13:33:01.656 -> 202000e8 20200108:SA:2020053c SO:2 AT:101 NB:2 SL:-1024 DA:403a0064 DO: 0 CI:200 DL:202000c8 CS:12 BI:200
13:33:01.656 -> After Async Update
13:33:01.656 -> .

b) Suspected issue - The DMASetting and DMAChannel structures are part of this display object:
Code:
  #elif defined(__IMXRT1052__) || defined(__IMXRT1062__)  // Teensy 4.x
  // try work around DMA memory cached.  So have a couple of buffers we copy frame buffer into
  // as to move it out of the memory that is cached...
  DMASetting   _dmasettings[2];
  DMAChannel   _dmatx;
  volatile    uint32_t _dma_pixel_index = 0;
  volatile uint16_t _dma_sub_frame_count = 0; // Can return a frame count...
  uint16_t          _dma_buffer_size;   // the actual size we are using <= DMA_BUFFER_SIZE;
  uint16_t          _dma_cnt_sub_frames_per_frame;  
  static const uint16_t    DMA_BUFFER_SIZE = 512;
  uint16_t          _dma_buffer1[DMA_BUFFER_SIZE] __attribute__ ((aligned(4)));
  uint16_t          _dma_buffer2[DMA_BUFFER_SIZE] __attribute__ ((aligned(4)));
  uint32_t      _spi_fcr_save;    // save away previous FCR register value
So I am guessing that some of these settings are getting out of sync between the actual memory and the cache, but not sure which way. Should I flush or delete the cache at these points? Or ???

Now back to pulling a few more hairs (and I don't have many left) :D
 
@KurtE
I really have no clue about DMA but have you tried flushing the cache. If its really getting out of sync flushing the cache periodically my get it back in sync.

I already don't have any hair left - still having problems with uncannyeyes! if I remove the static before those variables it works but there is no eye movement - argh. Only serial.printf seems to work reliably.
 
I already don't have any hair left - still having problems with uncannyeyes! if I remove the static before those variables it works but there is no eye movement - argh. Only serial.printf seems to work reliably.
Sorry (not sorry) about inflicting uncanny eyes on you. :cool: :D
 
Note: since that time, I have tried several places to flush or delete the cache and the DMA operation does not get beyond the first DMA output... I am not sure yet if it did the second buffer or not (replaceOnCompletion), but it does not appear like I am getting a second ISR callback...

Just to verify a few things, I thought I might see where different things are allocated. So a simple program:
Code:
uint8_t lower_buffer[32];
uint8_t upper_buffer[32] DMAMEM;
void setup() {
  uint8_t stack_buffer[32];
  uint8_t *heap_buffer = malloc(32);

  while (!Serial && millis() < 4000) ;
  Serial.begin(115200);
  delay(500);
  Serial.printf("Lower_Buffer: %x\n", (uint32_t)lower_buffer);
  Serial.printf("upper_buffer: %x\n", (uint32_t)upper_buffer);
  Serial.printf("stack buffer: %x\n", (uint32_t)stack_buffer);
  Serial.printf("Heap Buffer: %x\n", (uint32_t)heap_buffer);
  pinMode(13, OUTPUT);
}

void loop() {
  digitalWrite(13, !digitalRead(13));
  delay(500);
}

Printed out:
Code:
06:52:48.079 -> Lower_Buffer: 2000107c
06:52:48.079 -> upper_buffer: 20200000
06:52:48.079 -> stack buffer: 20077fd0
06:52:48.079 -> Heap Buffer: 20200028

And I am not sure what the statistics that print out during the build mean:
Code:
"D:\\arduino-1.8.9\\hardware\\teensy/../tools/arm/bin/arm-none-eabi-size" -A "C:\\Users\\kurte\\AppData\\Local\\Temp\\arduino_build_585570/foo.ino.elf"
Sketch uses 32528 bytes (1%) of program storage space. Maximum is 2031616 bytes.
Global variables use 35712 bytes (3%) of dynamic memory, leaving 1012864 bytes for local variables. Maximum is 1048576 bytes.
D:\arduino-1.8.9\hardware\teensy/../tools/teensy_post_compile -file=foo.ino -path=C:\Users\kurte\AppData\Local\Temp\arduino_build_585570 -tools=D:\arduino-1.8.9\hardware\teensy/../tools -board=TEENSY40 -reboot -port=usb:0/140000/0/7/4 -portlabel=hid#vid_16c0&pid_0478 Bootloader -portprotocol=Teensy

But I am assuming that with this program I have a lot of memory left in the lower (non-cached) region of memory.
That is between my one lower buffer and the stack is: 487252 bytes difference.
Obviously there are other things in that memory range, like all of the system buffers and the like.

So question is, is there some easy way to allocate some of that memory range on the fly?

That is, is there some system define like on some boards: end
which defines where the end of our last variable is defined?

Can we then create some system function like: mallocLowerMem(size_t size) or allocateNonCachedBuffer() or ...
Which allocates a new buffer.

Otherwise for this, I will probably do some memory wasting, and on T4, and for some of these Display classes like ST7735 or ILI9341_T3N or ... Just allocate their own pool of lower memory to use, but again will likely hit every code base that tries to use DMA, especially if those classes, the user might do: myclass *a = new myclass(...)

Another option, would also have classes/files like DMAChannel have their own lower memory pools, where the user program can call to allocate some buffer...

But again personally I think we should also change the defines:

That is, my declaration above: uint8_t upper_buffer[32] DMAMEM;
Implies that upper_buffer is a bad choice of memory to use for DMA.
Should maybe have something like: uint8_t upper_buffer[32] CACHEDMEM;
 
Sorry (not sorry) about inflicting uncanny eyes on you. :cool: :D

Thanks a lot Michael :). The only thing that seems to work is to put a Serial.printf into the code at a particular spot. Then it works. Has to be a memory problem - static variables. Their is no problem with the code on the T3.x's.
 
@KurtE - haven't a clue on this one - this stuff is out of the ballpark for me - but I guess I will be learning it soon :)
 
Joebobsicle over at EEVblog forums mentioned that on STM32 H7, invalidating (better name than deleting, IMHO) the caches for the buffers to work, only works if they're aligned to 32 bytes.

Maybe that is the case here also?

I recommend you allocate an extra 31 bytes per buffer, and use a pointer you align (within the buffer) yourself:
Code:
void *align32(void *ptr)
{
    uintptr_t  addr = (uintptr_t)ptr;
    if (addr & 31)
        return (void *)((addr | 31) + 1);
    else
        return ptr;
}

Apologies if you already do this in your code; I just wanted to point this out to anyone reading this thread. I don't have a T4.0 yet myself, so this is purely second-hand conjecture.

Actually, I suspect allocating an extra 48 bytes would work better. Then you can simply do (type*)((((uintptr_t)ptr)|31)+1) instead, and adjust the length similarly upwards for cache invalidation purposes, to next multiple of 32.
 
Thanks @Nominal Animal - Yes have tried that and the like. Note: The functions actually do get back to the memory alignment. Doing the extra allocation does remove having to do one more iteration in the loop...

But again still trying to figure out how cached memory works with things work with things like: DMAChannel ...
So for now I will probably do some hacks... But would be good to do it more at system level!



@MichaelMeissner - I look at it as an interesting test! May have to at some point get their new mask and T4it...

@mjs513 - Yep going to hack it now. Setup a static class structure with the small DMA Buffers, plus the DMAChannel and DMASettings, with an array of 3 sets of these. One for each SPI Buss. Yes will waste a couple of K of memory for now, but hopefully at some point will come up with a better system way to do it.
 
Thanks a lot Michael :). The only thing that seems to work is to put a Serial.printf into the code at a particular spot. Then it works. Has to be a memory problem - static variables. Their is no problem with the code on the T3.x's.

My one thought might be you are crossing the threshold between lower memory and upper memory, and maybe you need to add the appropriate magic to put all of the buffer in low memory (or flush the cache). Bear in mind, I have not read the full datasheet. Or maybe the fact that printf does I/O, perhaps something with turning off interrupts and re-enabling them allows things to work.
 
@MichaelMeissner - I look at it as an interesting test! May have to at some point get their new mask and T4it...
Yeah, I suspect when it is available, I will order one (to go along with the Hallowing they started selling last year that had a M0 processor and a single 128x128 display). But it would be nice to get it to work on Teensys.

I must admit having it all together eliminates my issue of keeping the wiring all connected (when you use it on a costume you take around for a full day, solder joints can break and/or form bridges, and the crimped wiring tends to come out.
 
My one thought might be you are crossing the threshold between lower memory and upper memory, and maybe you need to add the appropriate magic to put all of the buffer in low memory (or flush the cache). Bear in mind, I have not read the full datasheet. Or maybe the fact that printf does I/O, perhaps something with turning off interrupts and re-enabling them allows things to work.

@MichaelMeissner
Not quite sure how to that since are static variables defined in the sketch. Shouldn't these be in low memory anyway? . I did notice that things did get better with Kurt;s later update - it still hangs but I can upload the sketch without doing a 15s reboot. Have to think more on this.
 
@MichaelMeissner
Not quite sure how to that since are static variables defined in the sketch. Shouldn't these be in low memory anyway? . I did notice that things did get better with Kurt;s later update - it still hangs but I can upload the sketch without doing a 15s reboot. Have to think more on this.

I was thinking that perhaps there is enough other stuff declared static, and the buffer array is crossing the line between the two memory halves. Print out the begin and end addresses in hex to see if you are crossing the threshold. Also you might try using a higher alignment (and being careful in the DMA code not to cross alignment boundaries).
 
I was thinking that perhaps there is enough other stuff declared static, and the buffer array is crossing the line between the two memory halves. Print out the begin and end addresses in hex to see if you are crossing the threshold. Also you might try using a higher alignment (and being careful in the DMA code not to cross alignment boundaries).

To be honest - haven't a clue on DMA. That part of it has been in @KurtE's domain, so not quite sure how to do that. Sorry.

But I did do something from within the sketch. I added this:
Code:
   static boolean  eyeInMotion      = false;
   static int16_t  eyeOldX = 512, eyeOldY = 512, eyeNewX = 512, eyeNewY = 512;
   static uint32_t eyeMoveStartTime = 0L;
   static int32_t  eyeMoveDuration  = 0L;

   if(eyeMoveDuration == 0){
    Serial.printf("eyeInMotion: %x, eyeOldX: %x\n",&eyeInMotion,&eyeOldX);
    Serial.printf("eyeOldY: %x, eyeNewX: %x\n",&eyeOldY,&eyeNewX);
    Serial.printf("eyeNewY: %x\n",&eyeNewY);
    Serial.printf("eyeMoveStartTime: %x, eyeMoveDuration: %x\n",&eyeMoveStartTime,&eyeMoveDuration);
   }
and got this as output:
Code:
eyeInMotion: 2002886c, eyeOldX: 20027cb4
eyeOldY: 20027cb6, eyeNewX: 20027cba
eyeNewY: 20027cb8
eyeMoveStartTime: 20028868, eyeMoveDuration: 20028850
but the interesting thing I found is if I do the printf just once from with that function it works without trashing USB.

EDIT: Not even sure with the scenario I am testing (both displays on same SPI and no framebuffer set) that we are even using DMA?
 
Last edited:
Note, the version of this sketch that was locking up the device was not using dma, or at least not in display code...

Yesterday I did make some progress on getting dma to work, but now issue is one eye is just always outputting black, I think it has to do with timing, so will try a few more hacks. It was showing frame count of about 120
 
Forgot to mention the really bad hangs before, was more like things did not link correctly or maybe load...
 
Status
Not open for further replies.
Back
Top