T4.0 Memory - trying to make sense of the different regions

Hi Paul,
Talking about FlexSPI, anyone already tried to mmap another SPI device as PSRAM, in another region. For the SPI flash, is every block/page read cached in the FASTRUN too?
 
@Paul - I don't know if it makes sense to simply add another utility in the build process, that displays more information.
....
Which I know is outputting far too cryptic of data for most people, like recent run of testing the IO pins of the solderless breakout board...

I'm seriously considering doing something like this, though less cryptic, more like Arduino's format.

But running it from the postbuild recipe prints too soon, probably not even visible without scrolling with the default size of Arduino's console. It needs to print together with Arduino's summary.


But yes, I think RAM1 is better than the current stuff, although I can also imagine there will be some complaining about, I thought I purchased a board with 1mbs of memory why is there only 512KB...

Yeah, I'm concerned about that too.


Talking about FlexSPI, anyone already tried to mmap another SPI device as PSRAM, in another region.

I believe this came up during the beta test. I'm not aware of anyone actually getting it working. It's on my to-do list for Teensy 4.1.


For the SPI flash, is every block/page read cached in the FASTRUN too?

Caching is done in the M7 core, which is separate from RAM1 (aka ITCM or FASTRUN). There are 2 caches, both 32K, one for data and the other for code fetches.
 
I'm seriously considering doing something like this, though less cryptic, more like Arduino's format.

But running it from the postbuild recipe prints too soon, probably not even visible without scrolling with the default size of Arduino's console. It needs to print together with Arduino's summary.


Yeah, I'm concerned about that too.

For 'Compile DEBUG' output having cryptic stuff can be useful to those looking for it, or for forum debug questions/answers - even with scrolling needed. And those non debug or not interested would never see it just like all the other debug spew.

But indeed if 'Arduino summary' allows better info having an 'english' summary there would be good for all.
 
What exactly happens if you compile for smallest code size? Is all code remaining in Flash? So is it like every function was labeled PROGMEM? It typically free up RAM for me.
 
Smallest code is just a compilation option at this point. Rather than exploding code when it would add speed - it tends to reduce resulting code size.

It might be a future option for Paul to consider to having a default 'CODE FROM FLASH' build option to maximize available RAM. Some mention of alternate linker scripts was noted as possible but not detailed IIRC.
 
@Paul and others... As per earlier question about wondering about what marking some of the functions as FLASHMEM would do to startup time and run time...

I have tried adding flashing of pin 13, which is not showing much time in the copy of memory...

For the heck of it, will try marking of most of the functions for ILI9488_t3 graphic test to be FLASHMEM and see what it does in speed.
Before the changes memory profile:
Code:
FlexRAM section ITCM+DTCM = 512 KB
    Config : aaaaaaaf
    ITCM :  51600 B	(78.74% of   64 KB)
    DTCM :  17088 B	( 3.72% of  448 KB)
    Available for Stack: 441664
OCRAM: 512KB
    DMAMEM:   8272 B	( 1.58% of  512 KB)
    Available for Heap: 516016 B	(98.42% of  512 KB)
Flash:  64496 B	( 3.17% of 1984 KB)

And timings: with SPI requested for 60mhz...
Code:
ILI9488_t3n: (T4) SPI automatically selected

MOSI:11 MISO:12 SCK:13

ILI9488 Test!
Display Power Mode: 0x0
MADCTL Mode: 0x0
Pixel Format: 0x0
Image Format: 0x0
Self Diagnostic: 0x0
Benchmark                Time (microseconds)
Screen fill              307994
Text                     7449
Lines                    92433
Horiz/Vert Lines         25838
Rectangles (outline)     14423
Rectangles (filled)      744864
Circles (filled)         99631
Circles (outline)        79308
Triangles (outline)      19609
Triangles (filled)       235923
Rounded rects (outline)  32314
Rounded rects (filled)   817065
Done!

And again I only update the functions within the main sketch, but sizes different:
Code:
FlexRAM section ITCM+DTCM = 512 KB
    Config : aaaaaaaf
    ITCM :  48208 B	(73.56% of   64 KB)
    DTCM :  17088 B	( 3.72% of  448 KB)
    Available for Stack: 441664
OCRAM: 512KB
    DMAMEM:   8272 B	( 1.58% of  512 KB)
    Available for Heap: 516016 B	(98.42% of  512 KB)
Flash:  64688 B	( 3.18% of 1984 KB)

And timings:
Code:
enchmark                Time (microseconds)
Screen fill              307987
Text                     7503
Lines                    92546
Horiz/Vert Lines         25846
Rectangles (outline)     14430
Rectangles (filled)      744884
Circles (filled)         99789
Circles (outline)        78841
Triangles (outline)      19697
Triangles (filled)       235983
Rounded rects (outline)  32279
Rounded rects (filled)   817101
Done!
So I only saved 3392 bytes, actually I saved no bytes, but saved copying the bytes down as I would need to get down to next increment of 32KB...

And it did impact the timing a slight bit.


P.S. - Need to add FLASHMEM to keywords.txt
 
@KurtE, …

In the case of :: ITCM : 51600 B (78.74% of 64 KB)

Where the 'next increment of 32KB' : does leave a usable piece of RAM - could that be addressed for R/W usage at runtime?

In case above about 13,936 bytes.

The ITCM is marked the same as DTCM in the script as I saw it:
ITCM (rwx): ORIGIN = 0x00000000, LENGTH = 512K
DTCM (rwx): ORIGIN = 0x20000000, LENGTH = 512K

From:
Code:
	memory_copy(&_stext, &_stextload, &_etext);
where::

	_stext = ADDR(.text.itcm);
	_etext = ADDR(.text.itcm) + SIZEOF(.text.itcm);
	_stextload = LOADADDR(.text.itcm);

Would that be::
Code:
uint32_t  *SomeMemPtr = _stext + _etext;
uint32_t SizeOfSomeMem = (64*1024) - _etext;
 
Simple sample closer to generic case?

Though I am getting this output:
Code:
SizeOfSomeMem=15685 [KB=15]	SizeLeft_etext=15689 	len ITCM=3644801719
stext=0 	_stextload=0	etext=3644801719

With zero for _stext and _stextload - the numbers can't be right?

From:
Code:
extern unsigned long _stextload;
extern unsigned long _stext;
extern unsigned long _etext;

void setup()  {

  while (!Serial);  // Wait for Arduino Serial Monitor to open
  unsigned long SizeLeft_etext = (32 * 1024) - ((_etext-_stext)%(32*1024));
  unsigned long  *SomeMemPtr = (unsigned long*)((_etext+4)%4);
  unsigned long SizeOfSomeMem = SizeLeft_etext -4;
  Serial.printf( "\nSizeOfSomeMem=%u [KB=%u]\tSizeLeft_etext=%u \tlen ITCM=%u\n", SizeOfSomeMem,SizeOfSomeMem/1024, SizeLeft_etext, (_etext-_stext) );
  Serial.printf( "stext=%u \t_stextload=%u\tetext=%u \n", _stext, _stextload, _etext );
}

void loop() {}
 
Hi @defragster - It is the addresses of these names that is important... So I hacked a second part onto your program:
Code:
extern unsigned long _stextload;
extern unsigned long _stext;
extern unsigned long _etext;

void setup()  {

  while (!Serial);  // Wait for Arduino Serial Monitor to open
  unsigned long SizeLeft_etext = (32 * 1024) - ((_etext - _stext) % (32 * 1024));
  unsigned long  *SomeMemPtr = (unsigned long*)((_etext + 4) % 4);
  unsigned long SizeOfSomeMem = SizeLeft_etext - 4;
  Serial.printf( "\nSizeOfSomeMem=%u [KB=%u]\tSizeLeft_etext=%u \tlen ITCM=%u\n", SizeOfSomeMem, SizeOfSomeMem / 1024, SizeLeft_etext, (_etext - _stext) );
  Serial.printf( "stext=%u \t_stextload=%u\tetext=%u \n", _stext, _stextload, _etext );

  SizeLeft_etext = (32 * 1024) - (((uint32_t)&_etext - (uint32_t)&_stext) % (32 * 1024));
  SomeMemPtr = (unsigned long*)((_etext + 4) % 4);
  SizeOfSomeMem = SizeLeft_etext - 4;
  Serial.printf( "\nSizeOfSomeMem=%u [KB=%u]\tSizeLeft_etext=%u \tlen ITCM=%u\n", SizeOfSomeMem, SizeOfSomeMem / 1024, SizeLeft_etext,
                 ((uint32_t)&_etext - (uint32_t)&_stext) );

  Serial.printf( "&stext=%x(%u) \t&_stextload=%x(%u)\t&etext=%x(%u) \n",
                 (uint32_t)&_stext, (uint32_t)&_stext, (uint32_t)&_stextload, (uint32_t)&_stextload,
                 (uint32_t)&_etext, (uint32_t)&_etext );
}

void loop() {}
Which now gives you this output:
Code:
SizeOfSomeMem=11000 [KB=10]	SizeLeft_etext=11004 	len ITCM=3985822980
stext=0 	_stextload=0	etext=3985822980 

SizeOfSomeMem=9756 [KB=9]	SizeLeft_etext=9760 	len ITCM=23008
&stext=0(0) 	&_stextload=60001790(1610618768)	&etext=59e0(23008)
 
Hi @defragster - It is the addresses of these names that is important... So I hacked a second part onto your program:

Thanks, As soon as I quit for the night I was figuring I missed the point of passing the address to memory_copy().

Given that correction - is that a valid R/W usable RAM pointer (above the CODE) and with usable size of SizeOfSomeMem ?
 
@defragster - I am not sure if it is valid or not to be able to write into the ITCM memory area or not...

But One thing I have thought about doing, is maybe using more of the memory in the DTCM for buffers and the like.

Currently at startup time we do:
Code:
	// Initialize memory
	GPIO7_DR_SET = (1<<3); // digitalWrite(13, HIGH);
	memory_copy(&_stext, &_stextload, &_etext);
	GPIO7_DR_CLEAR = (1<<3); // digitalWrite(13, LOW);
	memory_copy(&_sdata, &_sdataload, &_edata);
	GPIO7_DR_SET = (1<<3); // digitalWrite(13, HIGH);
	memory_clear(&_sbss, &_ebss);
	GPIO7_DR_CLEAR = (1<<3); // digitalWrite(13, LOW);
So basically nothing above the address &ebss up to the stack is used for anything, except the stack:

Again back to the modified program I had:
Code:
md /c "D:\\arduino-1.8.10\\hardware\\teensy\\..\\tools\\arm\\bin\\arm-none-eabi-gcc-nm -n C:\\Users\\kurte\\AppData\\Local\\Temp\\arduino_build_434165\\bar.ino.elf | D:\\GITHUB\\imxrt-size\\Debug\\imxrt-size.exe"

FlexRAM section ITCM+DTCM = 512 KB
    Config : aaaaaaab
    ITCM :  23008 B	(70.21% of   32 KB)
    DTCM :  12992 B	( 2.64% of  480 KB)
    Available for Stack: 478528
OCRAM: 512KB
    DMAMEM:   8272 B	( 1.58% of  512 KB)
    Available for Heap: 516016 B	(98.42% of  512 KB)
Flash:  32528 B	( 1.60% of 1984 KB)

So as I mentioned we have > 475K for the stack which we typically don't use...
So for example if I have a simple ST7789 display 240x240 and wanted a Frame buffer that was in low memory... Today I could explicitly define it and pass it in, but suppose I don't. Today it will malloc it and be in High memory, and DMA screws up...

So was thinking could maybe write a uint32_t at memory location like _ebss at startup with maybe the next address after that location (or maybe rounded up to 16 bytes or...).
Then have a simple function, that maybe looks at that location + desired size and it does not get too close to the current stack address it reserves that memory and updates the value started at _ebss... to be beyond that newly reserved memory...

i.e. a quick and dirty heap (that has no free...)
 
@defragster - I am not sure if it is valid or not to be able to write into the ITCM memory area or not...

...

If the pointer math to this is right >> SomeMemPtr

It is easy to test for the length of >> SizeOfSomeMem

Of course that memory space is a coincidental amount - that may or may not be 2K or 30K based on code in RAM after compile in the uppermost ITCM 32K chunk.

Will have to read the next part again - but having a way to dynamically get use of all RAM is important.
 
Thank you all for the wealth of information. I was going to ask these questions today if it weren't for this thread. I just moved all my projects to the T4.0 from the T3.6, and am now learning how to optimize for the new MCU. Would the opinion of a novice be helpful? I hope so...

I have two difficulties reconciling the T4.0 with the promise of 'Arduino-compatible'. As you all say, the Arduino size info hides what's really going on. Secondly, it is not possible for me to use DMAMEM in a typical Arduino style of C++ (Where hackers are taught to avoid malloc() like the plague and DMA is forever a mystery). From a beginner's perspective all this DTCM, ITCM, OCRAM falls flat and the really important caveat that there are two separate RAM banks is lost. So would it be a worthwhile stopgap to:

1. Show the max ram in Arduino size info as only the FlexRam (512KB)
2. Instead of just printing "`.data' will not fit in region `DTCM'", also add "consider using DMAMEM for second RAM bank". It seems obvious that this particular error will 99% of the time be hit by beginners like me who didn't deeply consider the memory layout early in their program design.

Hopefully then the very helpful (but hard to find) 'Memory Layout' section in the T4.0 product page will help users solve these issues.

Personally I would love to use the imxrt-size tool, but I don't feel confident I can install it properly on my own. Anyway, I'll still just test my sketches to see if I run out of RAM since manipulating the regions used is quite advanced for me.
 
Teensy 3.6 has 256 KB of RAM - the primary RAM1 area of Teensy 4.0 is 512 KB. That is shared with code note marked as PROGMEM. Unless sketch code exceeds 256KB - the T4 will have more memory available with not other changes.

The memory called DMAMEM is just a name to indicate the upper 512KB RAM2 area - it does have different properties - but not really DMA specific when it comes to general usage. Indeed malloc() isn't generally the best idea when small memory area will be fragmented- or perhaps when used dynamically. But the malloc() code for semi-static runtime allocations certainly won't be a problem to have suitable RAM up to the T4's limit.

It is an unfortunate reality the 1MB RAM is split into the two 512KB blocks - and as such it doesn't yet have a way for the Arduino environment to document the split or usage … yet - though Paul is working on that - and detailing the usage on the teensy40.html page
 
@defragster - I am not sure if it is valid or not to be able to write into the ITCM memory area or not...
...

Looks like read and write works there … Obviously on startup the code is written there to fill ITCM with FLASH CODE - this finds that end - skips a word - then writes to next 32K boundary in sketch.
Adding GET and SHOW region below to any sketch should work, where only getFreeITCM() is needed in sketch to locate and point to space :: *ptrFreeITCM and find usable length :: sizeofFreeITCM

Code:
// _________________________________________________________________________
//  GET and SHOW FreeITCM RAM :: uint32_t *ptrFreeITCM and sizeofFreeITCM
uint32_t *ptrFreeITCM;  // Set to Usable ITCM free RAM
uint32_t  sizeofFreeITCM; // sizeof free RAM in uint32_t units.
uint32_t  SizeLeft_etext;
extern unsigned long _stextload;  // FROM LINKER
extern unsigned long _stext;
extern unsigned long _etext;
void   getFreeITCM() { // end of CODE ITCM, skip full 32 bits
  SizeLeft_etext = (32 * 1024) - (((uint32_t)&_etext - (uint32_t)&_stext) % (32 * 1024));
  sizeofFreeITCM = SizeLeft_etext - 4;
  sizeofFreeITCM /= sizeof(ptrFreeITCM[0]);
  ptrFreeITCM = (uint32_t *) ( (uint32_t)&_stext + (uint32_t)&_etext + 4 );
}
void showNumsITCM() {
  // sizeofFreeITCM=9180 [#uint32_t=2295] SizeLeft_etext=9184   len ITCM=23584
  Serial.printf( "\n\nsizeofFreeITCM=%u [#uint32_t=%u]\tSizeLeft_etext=%u \tlen ITCM=%u\n", sizeofFreeITCM * sizeof(uint32_t),
                 sizeofFreeITCM, SizeLeft_etext, ((uint32_t)&_etext - (uint32_t)&_stext) );
  // &stext=0(0)  &_stextload=60001720(1610618656)  &etext=5c20(23584)
  Serial.printf( "\n&stext=%x(%u) \t&_stextload=%x(%u)\t&etext=%x(%u) \n",
                 (uint32_t)&_stext, (uint32_t)&_stext, (uint32_t)&_stextload, (uint32_t)&_stextload,
                 (uint32_t)&_etext, (uint32_t)&_etext );
  // &stext=0(0)  &etext=5c20(23584)  ptrFreeITCM[0]=5c24(23588)  pLast=8000(32768)
  Serial.printf( "\n&stext=%x(%u) \t&etext=%x(%u) \tptrFreeITCM[0]=%x(%u) \tpLast=%x(%u) \n",
                 (uint32_t)&_stext, (uint32_t)&_stext, (uint32_t)&_etext, (uint32_t)&_etext,
                 (uint32_t)ptrFreeITCM, (uint32_t)ptrFreeITCM,
                 (uint32_t)ptrFreeITCM + sizeofFreeITCM * sizeof(uint32_t), (uint32_t)ptrFreeITCM + sizeofFreeITCM * sizeof(uint32_t) );
}
//  GET and SHOW FreeITCM RAM :: uint32_t *ptrFreeITCM and sizeofFreeITCM
// _________________________________________________________________________

void setup()  {
  while (!Serial);  // Wait for Arduino Serial Monitor to open
  Serial.println("\n\n++++++++++++++++++++++");
  getFreeITCM();
  showNumsITCM();
}

#define NUM_SHOW sizeofFreeITCM
void loop() {
  uint32_t ii;
  for ( ii = 0; ii < NUM_SHOW; ii++ ) {
    if ( !(ii % 5) )     Serial.printf( "\npITCM@%x \t", (ptrFreeITCM) + ii );
    Serial.printf( "%3u=%x\t", ii, ptrFreeITCM[ii] );
  }
  Serial.printf( "\nLast pITCM@%x \t", (ptrFreeITCM) + ii - 1 );
  for ( ii = 0; ii < NUM_SHOW; ii++ ) {
    ptrFreeITCM[ii] = micros();
  }

  Serial.println("\n    ============================");
  for ( ii = 0; ii < NUM_SHOW; ii++ ) {
    if ( !(ii % 5) )     Serial.printf( "\npITCM@%x \t", (ptrFreeITCM) + ii );
    Serial.printf( "%3u=%u\t", ii, ptrFreeITCM[ii] );
  }
  Serial.printf( "\nLast pITCM@%x \t", (ptrFreeITCM) + ii - 1 );
  showNumsITCM();
  while (1);
}

Sketch shows initial ITCM values - then writes in micros() to each and shows those values - with boundary calcs shown before and after:
Code:
…
pITCM@7fdc 	2270=671af	2271=671af	2272=671af	2273=671af	2274=671af	
pITCM@7ff0 	2275=671af	2276=671af	2277=671af	2278=671af	
Last pITCM@7ffc 	
    ============================

pITCM@5c64 	  0=419260	  1=419260	  2=419260	  3=419260	  4=419260	
pITCM@5c78 	  5=419260	  6=419260	  7=419260	  8=419260	  9=419260	
pITCM@5c8c 	 10=419261	 11=419261	 12=419261	 13=419261	 14=419261	
…
pITCM@7fc8 	2265=417411	2266=417411	2267=417411	2268=417411	2269=417411	
pITCM@7fdc 	2270=417411	2271=417411	2272=417411	2273=417411	2274=417411	
pITCM@7ff0 	2275=417411	2276=417411	2277=417412	2278=417412	
Last pITCM@7ffc 	

sizeofFreeITCM=9116 [#uint32_t=2279]	SizeLeft_etext=9120 	len ITCM=23648

&stext=0(0) 	&_stextload=60001720(1610618656)	&etext=5c60(23648) 

&stext=0(0) 	&etext=5c60(23648) 	ptrFreeITCM[0]=5c64(23652) 	pLast=8000(32768)
 
OCRAM and DMA operations:

@Paul and others... I know that I have posted about this in the past, and I know I keep seeing questions about it both forum and PMs and ...

And I don't know a correct answer to this...

Put simply: Doing DMA to and from the OCRam (DMAMEM and malloc operations) Sucks!

Or another way of saying it: DMAMEM on T4 stands for Worst memory to use to do DMA operations to and from.

I know during the beta, we could get around many of the issue by editing startup.c and turn off the caching in the line:
Code:
	SCB_MPU_RBAR = 0x20200000 | REGION(3); // RAM (AXI bus)
	SCB_MPU_RASR = MEM_CACHE_WBWA | READWRITE | NOEXEC | SIZE_1M;
https://forum.pjrc.com/threads/54711-Teensy-4-0-First-Beta-Test?p=194674&viewfull=1#post194674

But I know that is throwing the baby out with the Bath water...

There appear to be issues of doing DMA about alignment of the buffers. Like on 32 byte boundaries.... I know some libraries we have we malloc size+32 and then setup the start at 32 byte boundary, which is not the greatest. And it still did not help in continuous update cases.

I know that for DMA out of memory we need to do things like: _dcache_flush(write_data, count);

Again this is fine if you are doing one shot. But where is it documented that mortals will find it and do it?

And then when you do DMA into memory, there is the call: arm_dcache_delete(retbuf, count);

Again where is this documented? Should this be called when you startup the DMA operation? Or when it completes? If it should be done at completion, does this imply the calling code to get the data should first delete the cache memory before loading from it? ...

The problem is again I know, no one configuration will make everyone happy.

In most cases when I use it, I would be happy with it just turned off. Yes Frame buffer updates and like might be slightly slower, but not sure how much as when I do screen updates I have to call flush to write the whole thing out...

I don't know if it is possible to create multiple memory regions in OCRAM, where maybe DMAMEM goes to memory that is NOT WBWA cached? and maybe some other keyword like: CACHEDMEM does?
And user can somehow control where malloc come from?

Likewise in many of our programs today, we end up with lots of DTCM memory available. Like near 400K that is currently only used by STACK. Again wondering if we should have memory allocator out of here? And/Or option to have DMAMEM objects put into this space?

Again sorry for rambling on this.... (again)
 
Question: RAM1 is CPU speed - what speed is RAM2 that it needs to be cached? Some edit to repeat the p#66 ITCM code on DMAMEM blocks could tell me. After posting that code I added another group where the ITCM RAM is filled as follows and each RAM write was 3 clocks apart,
for ( ii = 0; ii < NUM_SHOW; ii++ ) {
ptrFreeITCM[ii] = ARM_DWT_CYCCNT;
}
Doing that across a block too large to cache, or scattered in turn across multiple RAM2/DMAMEM blocks might show the write rate if different than for RAM1. Those could catch a cache flush delay or MCU might choose write through

Paul posted on other thread about the CACHE func in imxrt.h - these don't include disabling cache on a RAM area? - Not sure if that is doable?:
Code:
// Flush data from cache to memory
static inline void arm_dcache_flush(void *addr, uint32_t size)

// Delete data from the cache, without touching memory
static inline void arm_dcache_delete(void *addr, uint32_t size)

// Flush data from cache to memory, and delete it from the cache
static inline void arm_dcache_flush_delete(void *addr, uint32_t size)


As a side ref - the ESP32 has multiple RAM sets as well - this doc goes through details on choosing and allocating : esp-idf/en/latest/api-guides/general-notes.html#memory-layout Not sure any of it directly relates - but they have many of the same issues - except their cache isn't implicated where they note it relates only to IRON ( flash code ) :: Access to this region is transparently cached using two 32kB blocks. So that would not impact DMA usage. Of course ESP only runs at 240 MHz so the RAM may all be CPU speed.
 
Again sorry for rambling on this.... (again)
Thanks for doing it.
especially of interest to me, what is the procedure for continues DMA (say in case of I2S input DMA)? Does ISR needs to address caching (as done in Output_I2S) or not (as done in Input_I2S)?
 
Add to p#66 code using:
Code:
#define DMA_SIZE 9000
DMAMEM uint32_t pDMA[3][DMA_SIZE];

There are some regular MCU cycle waits between 2 and 22 cycles when using that DMAMEM - RAM1 looks to be 2 cycles [below]:
Code:
pDMA[0]@20208ca0 	after CNT of   2 Diff CycCnt is   6 	with 409571276!=409571282
pDMA[0]@2021a5e0 	after CNT of   2 Diff CycCnt is   2 	with 409571288!=409571290
pDMA[0]@20223280 	after CNT of   1 Diff CycCnt is   8 	with 409571290!=409571298
pDMA[0]@2022bf20 	after CNT of   1 Diff CycCnt is   2 	with 409571298!=409571300
pDMA[0]@204d97e0 	after CNT of  78 Diff CycCnt is  11 	with 409571454!=409571465
pDMA[0]@204e2480 	after CNT of   1 Diff CycCnt is   2 	with 409571465!=409571467
pDMA[0]@204eb120 	after CNT of   1 Diff CycCnt is   6 	with 409571467!=409571473
pDMA[0]@204f3dc0 	after CNT of   1 Diff CycCnt is   2 	with 409571473!=409571475
pDMA[0]@2051fce0 	after CNT of   5 Diff CycCnt is  14 	with 409571483!=409571497
pDMA[0]@20528980 	after CNT of   1 Diff CycCnt is   2 	with 409571497!=409571499

...
pDMA[1]@20208ca0 	after CNT of   2 Diff CycCnt is   2 	with 409607138!=409607140
pDMA[1]@2023d860 	after CNT of   6 Diff CycCnt is   3 	with 409607150!=409607153
pDMA[1]@20246500 	after CNT of   1 Diff CycCnt is   2 	with 409607153!=409607155
pDMA[1]@20260ae0 	after CNT of   3 Diff CycCnt is  18 	with 409607159!=409607177
pDMA[1]@20269780 	after CNT of   1 Diff CycCnt is   2 	with 409607177!=409607179
pDMA[1]@20272420 	after CNT of   1 Diff CycCnt is   6 	with 409607179!=409607185
pDMA[1]@2027b0c0 	after CNT of   1 Diff CycCnt is   2 	with 409607185!=409607187
pDMA[1]@202b8920 	after CNT of   7 Diff CycCnt is  18 	with 409607199!=409607217
pDMA[1]@202c15c0 	after CNT of   1 Diff CycCnt is   2 	with 409607217!=409607219
pDMA[1]@202ed4e0 	after CNT of   5 Diff CycCnt is  14 	with 409607227!=409607241
pDMA[1]@202f6180 	after CNT of   1 Diff CycCnt is   2 	with 409607241!=409607243
pDMA[1]@202fee20 	after CNT of   1 Diff CycCnt is   6 	with 409607243!=409607249
pDMA[1]@20307ac0 	after CNT of   1 Diff CycCnt is   2 	with 409607249!=409607251
pDMA[1]@20345320 	after CNT of   7 Diff CycCnt is  10 	with 409607263!=409607273
pDMA[1]@2034dfc0 	after CNT of   1 Diff CycCnt is   2 	with 409607273!=409607275

...
pDMA[2]@20208ca0 	after CNT of   2 Diff CycCnt is   2 	with 409644423!=409644425
pDMA[2]@20272420 	after CNT of  12 Diff CycCnt is  22 	with 409644447!=409644469
pDMA[2]@2027b0c0 	after CNT of   1 Diff CycCnt is   2 	with 409644469!=409644471
pDMA[2]@202b8920 	after CNT of   7 Diff CycCnt is  18 	with 409644483!=409644501
pDMA[2]@202c15c0 	after CNT of   1 Diff CycCnt is   2 	with 409644501!=409644503
pDMA[2]@202ed4e0 	after CNT of   5 Diff CycCnt is  14 	with 409644511!=409644525
pDMA[2]@202f6180 	after CNT of   1 Diff CycCnt is   2 	with 409644525!=409644527
pDMA[2]@202fee20 	after CNT of   1 Diff CycCnt is   6 	with 409644527!=409644533
pDMA[2]@20307ac0 	after CNT of   1 Diff CycCnt is   2 	with 409644533!=409644535
pDMA[2]@20345320 	after CNT of   7 Diff CycCnt is  10 	with 409644547!=409644557

Took off the DMAMEM descriptor and running from RAM1 { memaddr changed but not the var name pDMA} it looks like this with the 902 anomoly always there in group [2] - but it shifts a few positions. And the Bank [0] group repeats the 6,2,8,2 pattern. And Bank [1] just shows 2 cycles difference:
Code:
pDMA[0]@20009bb0 	after CNT of   2 Diff CycCnt is   6 	with 263201668!=263201674
pDMA[0]@2001b4f0 	after CNT of   2 Diff CycCnt is   2 	with 263201680!=263201682
pDMA[0]@20024190 	after CNT of   1 Diff CycCnt is   8 	with 263201682!=263201690
pDMA[0]@2002ce30 	after CNT of   1 Diff CycCnt is   2 	with 263201690!=263201692
DONE after CNT of 8994 Diff CycCnt is   2 	with 263219678!=263219680
    ============================

pDMA[1]@20009bb0 	after CNT of   2 Diff CycCnt is   2 	with 263219697!=263219699
DONE after CNT of 8998 Diff CycCnt is   2 	with 263237693!=263237695
    ============================

pDMA[2]@20009bb0 	after CNT of   2 Diff CycCnt is   2 	with 263237709!=263237711
[B]pDMA[2]@287d2790 	after CNT of 3955 Diff CycCnt is 902 	with 263245619!=263246521[/B]
pDMA[2]@287db430 	after CNT of   1 Diff CycCnt is   2 	with 263246521!=263246523
pDMA[2]@28cde770 	after CNT of 146 Diff CycCnt is   3 	with 263246813!=263246816
pDMA[2]@28ce7410 	after CNT of   1 Diff CycCnt is   2 	with 263246816!=263246818
DONE after CNT of 4895 Diff CycCnt is   2 	with 263256606!=263256608
    ============================

RAM1 original test, with pointer to ITCM, indexes from the pointer differently - not as an array - and results in 4 cycles - not 2. Need to use the same Diff CycCnt testing to be sure.

An hour past starting personal sleep cycle - can add loops to repeat within 32KB area a few times and see if I can get the cache to even out the times to prove it working - and post code if any interest.
 
But where is it documented that mortals will find it and do it?

Currently DMA on Teensy 4.0 is not really documented anywhere, except some comments in imxrt.h and messages scattered across this forum. Neither is DMA on Teensy 3.x or LC. Many other things are also sorely in need of documentation. I'm planning to work on many new web page to cover these topics, starting around the end of this month. Ping me in early December...

However, I do not believe any amount of documentation is going to make DMA easily accessible to most people. It's an advanced topic. Using DMA successfully (without extreme luck) requires tough troubleshooting.



what speed is RAM2 that it needs to be cached?

The simple answer is 150 MHz, or 1/4 of whatever speed the M7 processor is running.

But the longer answer depends on details of how these buses and the bridges between them work. Sadly, NXP's documentation on those details is rather scant.
 
Thanks @Paul,

Another interesting sub-question is in practice what is the performance differences and trade offs for the different caching options.

Example we currently use: MEM_CACHE_WBWA Which is Write Back memory. What would the speed difference be for using: MEM_CACHE_WT (Write Through)?

Would going to Write Through memory speed up or slow down programs that for example use the DMAMEM or like for a Frame buffer for a display?

Example if ILI9341 display with Frame buffer in upper memory and we do a few graphic operations to update some things on the screen and then call the updateScreen function to output the data in the frame buffer to the display.

My first attempt to make this work was to call the arm_dcache_flush on the entire frame buffer in order to make sure the DMA operation got the real contents. Again I have not tried to test the differences in timing to see what happens if your screen update code for example only touched a quarter of the screen, between having the data output using _WT and not needing the flush versus using the _WBWA and needing to do flush. Example I don't know how the functions/registers like: SCB_CACHE_DCCMVAC timings work and if it differs depending on if any memory in that 32 bytes was updated or not...

Note: That was the first attempt. However I was still having issues with using the arm_dcache_flush when I tried turning on continuous update mode. As when/where should you call the flush? Before I was just doing it when I started up the DMA operation. But then suppose the sketch then does stuff to update the screen, which is of course the idea of continuous updates. I ran into several issues on this, where for example you do something like a fillRect in the middle of the screen. The interesting this is some of the new color makes it through the cache to memory and some does not. So you end up with splats of different colors. There are probably several different ways to solve it, including maybe:
a) Have each of your DmaSettings objects, have their Interrupt on Completion bit set, and do a _dcache_flush operation on them self to reflect the next frame, but again this does not handle any difference in data between now and the next frame. But should only see splotch on one frame.
b) Maybe have each of the above Interrupts try to flush the whole cache again... - Maybe fewer splotches but how long does that take?
c) Have each of our graphic primitives try to do a cache flush for the region of memory they touched.

Or do what I am currently doing and NOT do the DMA from this memory. Currently I have two smaller buffers in DTCM with the DMASettings pointing to these and I copy the first parts of the Frame buffer into each of these buffers and I interrupt on each buffer completion and copy in the next set of pixels... Again I don't like this, but not sure of anything better yet...
 
150 MHz Ram2 ( .25 of F_CPU ) - thanks Paul.

Updated my test code - I had addresses printed missing an array dimension in above post - so ignore those.

I added code for DMA cycle per write test and did the same for RAM - RAM still shows that anomaly - I added three runs in succession in the code and it hits something odd there? Also is the DMA test run once where cache can't help on first access as it runs through 36KB three times in the arrays, then I did a shorter set of 900 32 bit elements 10 times in succession and the results on the last are about 2.5 cycles instead of 4 cycles on average. I turned off the copious spew on DMA changes and just show the average, then follows the RAM results. The other ODD thing I used the arm_dcache_flush_delete() on the DMA memory area under test - perhaps not to best way but it mad the DMA number below go up a cycle count per write:
Code:
    ============================  [B]DMA TEST Single[/B]

Avg CycCnt for 9000 is 3.983444
    ============================

Avg CycCnt for 9000 is 4.155334
    ============================

Avg CycCnt for 9000 is 3.998778
    ============================

    ============================  [B]DMA TEST short Repeat[/B]

Avg CycCnt for 900 is 2.765556
    ============================

Avg CycCnt for 900 is 2.490000
    ============================

Avg CycCnt for 900 is 2.192222
    ============================

    ============================ [B]RAM TEST[/B]

pRAM[0]@20001080 	after run of    2 Diff CycCnt is   7 	with 390861160!=390861167
pRAM[0]@20001088 	after run of    2 Diff CycCnt is   2 	with 390861174!=390861176
pRAM[0]@2000108c 	after run of    1 Diff CycCnt is   9 	with 390861176!=390861185
pRAM[0]@20001090 	after run of    1 Diff CycCnt is   2 	with 390861185!=390861187
DONE after run of 8994 Diff CycCnt is   2 	with 390879173!=390879175
Avg CycCnt for 9000 is 2.001889
    ============================

pRAM[1]@20009d20 	after run of    2 Diff CycCnt is   2 	with 390879193!=390879195
DONE after run of 8998 Diff CycCnt is   2 	with 390897189!=390897191
Avg CycCnt for 9000 is 2.000000
    ============================

pRAM[2]@200129c0 	after run of    2 Diff CycCnt is   2 	with 390897204!=390897206
pRAM[2]@200168f4 	after run of 4045 Diff CycCnt is 783 	with 390905294!=390906077
pRAM[2]@200168f8 	after run of    1 Diff CycCnt is   2 	with 390906077!=390906079
pRAM[2]@20016b38 	after run of  144 Diff CycCnt is   3 	with 390906365!=390906368
pRAM[2]@20016b3c 	after run of    1 Diff CycCnt is   2 	with 390906368!=390906370
DONE after run of 4807 Diff CycCnt is   2 	with 390915982!=390915984
Avg CycCnt for 9000 is 2.086889
    ============================

    ============================ [B]RAM TEST[/B]

pRAM[0]@20001080 	after run of    2 Diff CycCnt is   7 	with 391222645!=391222652
pRAM[0]@20001088 	after run of    2 Diff CycCnt is   2 	with 391222659!=391222661
pRAM[0]@2000108c 	after run of    1 Diff CycCnt is   9 	with 391222661!=391222670
pRAM[0]@20001090 	after run of    1 Diff CycCnt is   2 	with 391222670!=391222672
DONE after run of 8994 Diff CycCnt is   2 	with 391240658!=391240660
Avg CycCnt for 9000 is 2.001889
    ============================

pRAM[1]@20009d20 	after run of    2 Diff CycCnt is   2 	with 391240678!=391240680
DONE after run of 8998 Diff CycCnt is   2 	with 391258674!=391258676
Avg CycCnt for 9000 is 2.000000
    ============================

pRAM[2]@200129c0 	after run of    2 Diff CycCnt is   2 	with 391258689!=391258691
pRAM[2]@200166bc 	after run of 3903 Diff CycCnt is 711 	with 391266495!=391267206
pRAM[2]@200166c0 	after run of    1 Diff CycCnt is   2 	with 391267206!=391267208
DONE after run of 5094 Diff CycCnt is   2 	with 391277394!=391277396
Avg CycCnt for 9000 is 2.078778
    ============================

    ============================ [B]RAM TEST[/B]

pRAM[0]@20001080 	after run of    2 Diff CycCnt is   6 	with 391571329!=391571335
pRAM[0]@20001088 	after run of    2 Diff CycCnt is   2 	with 391571341!=391571343
pRAM[0]@2000108c 	after run of    1 Diff CycCnt is   8 	with 391571343!=391571351
pRAM[0]@20001090 	after run of    1 Diff CycCnt is   2 	with 391571351!=391571353
pRAM[0]@200028ac 	after run of 1543 Diff CycCnt is   3 	with 391574437!=391574440
pRAM[0]@200028b0 	after run of    1 Diff CycCnt is   2 	with 391574440!=391574442
DONE after run of 7450 Diff CycCnt is   2 	with 391589340!=391589342
Avg CycCnt for 9000 is 2.001667
    ============================

pRAM[1]@20009d20 	after run of    2 Diff CycCnt is   2 	with 391589358!=391589360
DONE after run of 8998 Diff CycCnt is   2 	with 391607354!=391607356
Avg CycCnt for 9000 is 2.000000
    ============================

pRAM[2]@200129c0 	after run of    2 Diff CycCnt is   2 	with 391607369!=391607371
pRAM[2]@20016acc 	after run of 4163 Diff CycCnt is 773 	with 391615695!=391616468
pRAM[2]@20016ad0 	after run of    1 Diff CycCnt is   2 	with 391616468!=391616470
DONE after run of 4834 Diff CycCnt is   2 	with 391626136!=391626138
Avg CycCnt for 9000 is 2.085667
    ============================
 
Last edited:
@KurtE - I uninstalled VS2017 weeks back after 2019 released and you were using it - without install done - and just started - I can't run the imxrt-size.exe from github - your or my old one ...

Code:
imxrt-size.exe - System Error
---------------------------
The code execution cannot proceed because VCRUNTIME140D.dll was not found. Reinstalling the program may fix this problem.

Maybe tomorrow if the download install ever gets done … almost 1 GB of 2.6 … or more …

might be fun to have a readme showing : recipe.hooks.postbuild.4.pattern.windows=cmd /c "{runtime.hardware.path}\..\tools\arm\bin\arm-none-eabi-gcc-nm -n {build.path}\{build.project_name}.elf | T:\Programs\TSet\imxrt-size.exe"
 
@KurtE : VS2019 installed - not built - but github debug imxrt-size.exe works now! And back in Sublimetext to compile and build.

Code:
FlexRAM section ITCM+DTCM = 512 KB
    Config : aaaaaaab
    ITCM :  23200 B	(70.80% of   32 KB)
    DTCM :  12992 B	( 2.64% of  480 KB)
    Available for Stack: 478528
OCRAM: 512KB
    DMAMEM:   8272 B	( 1.58% of  512 KB)
    Available for Heap: 516016 B	(98.42% of  512 KB)
Flash:  32768 B	( 1.61% of 1984 KB)

Paul - that shows 9K ITCM Fragment that is accessible for use as RAM - but orphaned.
Question: Is it possible or likely that after putting FLASH code in ITCM that area would ever get marked 'No WRITE' ?
I wrote code that got that orphan start address and length and did read/write test to it without issues as it stands.
 
Back
Top