malloc/free for EXTMEM and DTCM

Looks like the standard Serial port is now a interface 0 from a compound device, where as the old version generates the usual stand alone device. Is this intended?

Yes, planning to switch to composite device for all USB serial.

The CDC class at device level was needed to support Windows XP-SP2, Windows Vista (no service packs) and Macintosh OS-X 10.5 & 10.6.
 
Ok, got it. If this is permanent I'll log an issue at TyCommander. For the time being I can work with the dual serial.
 
Looks like you allocate all of the remaining free space in EXTRAM for the EXTRAM heap. This is a fantastic idea. I was only using user defined heap buffers so far but this makes it much more easy to use of course


Here's a quick test which queries the free space, then tries to allocate it all at once. Fails.
The reported free space is the total free space not the largest free chunk. Don't know if you saw my video here https://www.youtube.com/watch?v=s3U5QSO7Rd8 It prints out the reported stats on the right side.
 
BTW: it is still possible to use the full functionality of sm_alloc. E.g. Add an additional Heap on DMAMEM:

Code:
DMAMEM char dmaBuffer[10000];
smalloc_pool dmaHeap;

void setup()
{
  while (!Serial){}

  sm_set_pool(&dmaHeap, dmaBuffer, 10000, false, nullptr);

  char *extMemChunk = (char *)extmem_malloc(100);
  char *dmaMemChunk = (char *)sm_malloc_pool(&dmaHeap,100);

  Serial.printf("extMemPtr: %p dmaMemPtr %p\n", extMemChunk, dmaMemChunk);
}

void loop()
{
}

Which prints:
Code:
extMemPtr: 0x7000000c dmaMemPtr 0x2020000c

Maybe do the same pre allocation for DMAMEM as for EXTMEM to make usage simpler?
 
@luni

Ok think you need to give me another lesson here. Using your printInfo function which boils down to basically:
Code:
    size_t total, totalUser, totalFree;
    int nrBlocks;
    sm_malloc_stats(&total, &totalUser, &totalFree, &nrBlocks);
    Serial.printf(" %u %u %u %u\n",  total, totalUser, totalFree, nrBlocks);
should return bytes for total, totalUser and totalFree?

So if I add it to my example along with some of Paul's example I get:
Code:
{total, totalUser, totalFree, nrBlocks
 1028 1074774016 352 536874412
{pauls example"}
Free Space: 8388552
Total: 48
Initializing values...

Initialized values
1
3
5
7
9

Here is what I am using:
Code:
#include "Streaming.h"
#define cout Serial
#include "smalloc.h"

void setup() {
  while(!Serial)
    ;
  int *ptr;
  ptr = (int*) extmem_malloc(5*sizeof(int));
    size_t total, totalUser, totalFree;
    int nrBlocks;
    sm_malloc_stats(&total, &totalUser, &totalFree, &nrBlocks);
    Serial.printf(" %u %u %u %u\n",  total, totalUser, totalFree, nrBlocks);

    Serial.print("Free Space: ");
  size_t total1 = 0, freespace = 0;
  sm_malloc_stats_pool(&extmem_smalloc_pool, &total1, NULL, &freespace, NULL);
  Serial.println(freespace);
  Serial.print("Total: "); Serial.println(total);
  if(!ptr)
  {
    cout << "Memory Allocation Failed"  << endl;
    exit(1);
  }
  cout << "Initializing values..." << endl << endl;

  for (int i=0; i<5; i++)
  {
    ptr[i] = i*2+1;
  }
  cout << "Initialized values" << endl;
  for (int i=0; i<5; i++)
  {
    /* ptr[i] and *(ptr+i) can be used interchangeably */
    cout << *(ptr+i) << endl;
  }

  extmem_free(ptr);

}

void loop() {
  // put your main code here, to run repeatedly:

}
 
Maybe do the same pre allocation for DMAMEM as for EXTMEM to make usage simpler?

Regular newlib malloc() already puts its heap on all the unused DMAMEM (the other 512K which isn't ITCM / DTCM).

Yeah, you could create a static array in DMAMEM and use it with smalloc, but what's the point when regular malloc already uses it?
 
Sorry, I ment DTCM of course

Code:
#include "smalloc.h"

char dtcmBuffer[100000];
smalloc_pool dmaHeap;

void setup()
{
  while (!Serial){}

  sm_set_pool(&dmaHeap, dtcmBuffer, 10000, false, nullptr);

  char *extMemChunk = (char *)extmem_malloc(100);
  char *dtcmMemChunk = (char *)sm_malloc_pool(&dmaHeap,100);

  Serial.printf("extMemPtr: %p dtcmMemPtr %p\n", extMemChunk, dtcmMemChunk);
}

Code:
extMemPtr: 0x7000000c dtcmMemPtr 0x20000d50
 
@luni

Ok think you need to give me another lesson here. Using your printInfo function which boils down to basically:
Code:
    size_t total, totalUser, totalFree;
    int nrBlocks;
    sm_malloc_stats(&total, &totalUser, &totalFree, &nrBlocks);
    Serial.printf(" %u %u %u %u\n",  total, totalUser, totalFree, nrBlocks);
should return bytes for total, totalUser and totalFree?

Paul didn't set up the default pool, so sm_malloc_stats won't give correct values. You need to use the pool version:

Code:
void setup()
{
  while (!Serial);

  int* ptr = (int*)extmem_malloc(5 * sizeof(int));

  size_t total, totalUser, free;
  int blocks;
  sm_malloc_stats_pool(&extmem_smalloc_pool, &total, &totalUser, &free, &blocks);

  Serial.printf(
      "Total used:       %u bytes\n"
      "Total user vars:  %u bytes\n"
      "Free space:       %.4f MB\n"
      "Allocated blocks: %u\n\n",
      total, totalUser, free / 1024 / 1024.0f, blocks);
}

Which prints the correct values:

Code:
Total used:       48 bytes
Total user vars:  20 bytes
Free space:       15.9990 MB
Allocated blocks: 1

Each allocated chunk of memory is contained in one block. The block starts with a 12byte header, followed by the user data + some fill bytes and a closing 12byte tag.

Anmerkung 2020-10-26 210335.jpg

The fill bytes will fill up the user data to n*12 bytes. The returned pointer points to the beginning of the user data of course. If you want to get a pointer to the block you can use the macro USER_TO_HEADER as shown in #11.

Hope that answers your question somehow?
 
@luni

Yep that answers my question including my unasked question about what is a chunk - thanks for explanation!
 
Maybe we should have a way to pad allocations to 32-byte cache row boundaries?

From the readme:
Searches are done by shifting a header-wide pointer across the pool.
Allocated block is found by testing each possible header for validity.

So, as it is now the block size needs to be a multiple of he header size. If I understand the readme correctly the header has 4bytes for the block size, 4bytes for the size of the user memory and 4bytes for a hash -> My gut feeling is that 32bit blocks will need some major rewriting of the library. Might be interesting of course...
 
Indeed, seems likely we'll eventually have to make substantial changes to smalloc, or replace it with some other memory management scheme. For now (version 1.54) this is probably good enough. Long-term, we probably do need alignment to cache rows to avoid thorny issues when people try to use DMA on their allocated memory.

I've added a comment in smalloc.h to advise against using its API directly from Arduino sketches & libraries.

https://github.com/PaulStoffregen/cores/commit/a5736a3ffdc9b88d0a15aae565e918ce498a7df3

If we do end up changing the underlying memory management, at least there's a warning that smalloc.h may change in future versions.
 
Sounds good. My guess is that DMA from EXTMEM should also be 32 byte aligned as well? Note: Currently I don't think malloc is 32 byte aligned either.

Which is why in some of our display drivers that do DMA we have things like:
Code:
			_we_allocated_buffer = (uint8_t *)malloc(CBALLOC+32);
			if (_we_allocated_buffer == NULL)
				return 0;	// failed 
			_pfbtft = (RAFB*) (((uintptr_t)_we_allocated_buffer + 32) & ~ ((uintptr_t) (31)));

I was playing around with reading the OV... Camera (640x480 2 bytes) and trying to output to ILI9488 (480x320);

I had the Camera buffer allocated using EXTMEM and I allocated the frame buffer using new memory allocater ...
Will probably move the frame buffer back to DMAMEM
 
Re: Loss of Teensy with current cores about 10pm 10/26? was 4 hours old then ...

Sent over code and tyComm could not see them.

Switched IDE to TeensyLoader

Did 15s Restore and the Red LED for Button bootloader no longer appears.

They boot and blink - button stops Orange blink - but no RED led appears

Cannot program T_4.1 or T_4.0

USBView from MSFT and nirsoft USBDevView don't indicate anything Teensy.

Rebooted just 38 hours ago - suppose will do it again to see ...
 
Moved the T_4.0 and 4.1 in turn to cable on USB Port - Both act normally there

But it seems the HUB is Offline????

Umplugged and moved HUB to another port and it seems okay now ???
 
Ive had success using tinyalloc before https://github.com/thi-ng/tinyalloc

Initialised something like this. Chunk sizes etc could be adjusted to suit your application.

Code:
//Init external RAM and memory heap
EXTMEM uint8_t ext_ram[1]; //Just to get EXTMEM pointer
extern uint8_t external_psram_size; //in MB. Set in startup.c

void init_memory()
{
    uint32_t psram_bytes = 1024 * 1024 * external_psram_size;
    ta_init((void *)(ext_ram),               //Base of heap
            (void *)(ext_ram + psram_bytes), //End of heap
            psram_bytes / 32768,             //Number of memory chunks (32k/per chunk)
            16,                              //Smaller chunks than this won't split
            32);                             //32 word size alignment
}

This allows use of ta_alloc, ta_free, ta_calloc etc. directly into external RAM.
 
IMO, the memory allocation details should be abstracted away for the typical use case. Ie, a single mallocX() call where you optionally pass it a hint as to the speed desired (or other requirements like DMA alignment). If the requested speed section is full, it automatically provides the next fastest available memory.

Not sure I followed - memory allocated for C++ classes now comes from DMAMEM but would be significantly faster if it came from DTCM? Perhaps this could be done automatically - the first x bytes are allocated from DTCM and after that it comes from DMAMEM.
 
IMO, the memory allocation details should be abstracted away for the typical use case. Ie, a single mallocX() call where you optionally pass it a hint as to the speed desired (or other requirements like DMA alignment). If the requested speed section is full, it automatically provides the next fastest available memory.

Not sure I followed - memory allocated for C++ classes now comes from DMAMEM but would be significantly faster if it came from DTCM? Perhaps this could be done automatically - the first x bytes are allocated from DTCM and after that it comes from DMAMEM.

I partially agree. That is maybe a new API can be created like: heapAlloc or the like which maybe take in optional flags on which heap to give preference...

However I don't see us replacing malloc/free api signature as maybe too risky to existing sketches.

I would also very much like some ability to allocate memory out of DTCM. How much? Would depend on how much of the memory is already used, and probably some way to reserve enough space for Stack. Hopefully with some way for the sketch to set that limit...
 
Not sure I followed - memory allocated for C++ classes now comes from DMAMEM but would be significantly faster if it came from DTCM?

Yes and no.

Yes, C++ classes, if created by "new", are indeed allocated in the OCRAM accessed by the slower AXI bus.

But usually the speed different isn't "significantly faster". Typically there is little if any speed difference, thanks to the 32K L1 caches. While the AXI bus does use a slower clock, it's still a 64 bit wide bus with many advanced features. It's no slouch (like the PSRAM chip) for cache misses.
 
KurtE said:
I would also very much like some ability to allocate memory out of DTCM. How much? Would depend on how much of the memory is already used, and probably some way to reserve enough space for Stack. Hopefully with some way for the sketch to set that limit...

jonr said:
Not sure I followed - memory allocated for C++ classes now comes from DMAMEM but would be significantly faster if it came from DTCM? Perhaps this could be done automatically - the first x bytes are allocated from DTCM and after that it comes from DMAMEM.

Actually, smalloc is able to use more than one memory pool at the same time. Here an example how to setup an additional pool on DTCM. This example also shows how to use use placement new to construct c++ objects in this pool. It is a bit clumsy but, AFAIK, there is no other possibility in c++ without redefining 'new'.

Code:
#include "smalloc.h"
#include <new>

uint8_t dtcmBuffer[100*1024]; // Generate a 100kB memory pool on DTCM
smalloc_pool dtcmPool;

IntervalTimer* timer;         // to be constructed on DTCM

void setup()
{
    while (!Serial){}

    sm_set_pool(&dtcmPool, dtcmBuffer, 100*1024, true, nullptr);    // initialize pool, zero allocated memory, no out of memory callback

    uint32_t* u1 = (uint32_t*)sm_malloc_pool(&dtcmPool,sizeof(uint32_t)); // one uint32_t on DTCM
    char* text   = (char*)sm_malloc_pool(&dtcmPool,100);                  // c-string, 100 bytes on DTCM

    *u1 = 100;
    text = strcpy(text, "Hello World");

    // c++ objects on dtcm heap:
    void* mem = sm_malloc_pool(&dtcmPool, sizeof(IntervalTimer));  // allocate memory for the timer object
    timer = new(mem) IntervalTimer();                              // placement new to construct object in allocated memory chunk
    timer->begin([] { Serial.println(millis()); }, 200'000);       // setup timer to print millis() every 200ms
    
    Serial.printf("var: u1   addr->%p content->%u\n", u1, *u1);
    Serial.printf("var: text addr->%p content->%s\n", text, text);
}

void loop(){
}

which prints:
Code:
var: u1   addr->0x20000d90 content->100
var: text addr->0x20000db4 content->Hello World
587
787
987
1187
...
 
Last edited:
Thanks @luni -

Yes, I know that you can have more than one memory pool. The interesting question is, is there a top level api, or class or... That for example if there are three memory pools or 4?
of: PSRAM(Real slow), DMAMEM(Sort of Slow), DTCM(Fast), ITCM(fast and desperate) that can be setup, that for example, I can choose which of these pools to use depending on some criteria passed in and maybe knowing how much space is free in the different heaps? Also one where you can hopefully just pass in your pointer to something like heapFree() and it will know which pool that belongs to...

I also wonder about if there should be a more automatic way to get the DTCM heap than:
Code:
uint8_t dtcmBuffer[100*1024]; // Generate a 100kB memory pool on DTCM
smalloc_pool dtcmPool;

That may work in one sketch but not another and it also masks how much space the sketch is actually using. Where for example if your run a sketch on a T3.6, the heap starts just after all defined variables and grows up toward the stack. When you build the sketch on T3.x, the linker output will give you a hint of just how much data space you have defined within your sketch. But with these big defines that may or may not be used up, you don't see that information.

So a question is, should there be a DTCM heap setup on T4.x which does similar? That it is lowest memory is at the point just after the end of all variables, and grows up to some maximum. Maybe by default have DTCM_HEAP_END or the like defined at the high stack pointer minus some reserved stack space....
 
Maybe this is a good moment to mention I've been continuing to work on the Teensy 4.1 page. Quite a lot was (hastily) written on the Teensy 4.0 page about memory. I've been converting it to the new format, and adding EXTMEM stuff, including an updated diagram with PSRAM and LittleFS in the flash.

https://www.pjrc.com/store/teensy41.html#memory

Please take a look and let me know if I've missed anything important?

I'm planning to add another page with examples and discussion of the performance for each memory. The Teensy 4.0 page has some of that performance discussion, which I've intentionally left off the 4.1 page (and will disappear from the 4.0 page soon) because it's meant to go to a dedicated page and get rewritten with actual benchmarks.


The interesting question is, is there a top level api, or class or...

No API or C++ class exists in the core library for heap on DTCM or unused ITCM.


So a question is, should there be a DTCM heap setup on T4.x which does similar?

Maybe. The "should" part is a difficult question.

On the plus side, it makes more memory available and maybe allows certain use cases to achieve better performance. But the performance part is rather questionable, since the AXI bus is still quite fast and M7's 32K cache probably closes almost all the performance gap for common use cases. The main downside is adding even more complexity to any already pretty complicated memory system, on a chip loaded with a tremendous amount of advanced but complex features.

My gut feeling is DTCM heap belongs in a library which users can install if they want an easy way to get a 3rd heap. A library could have its own readme on github or a dedicated web page to explain how to make use of another heap. Ideally such a library would have several examples, hopefully some with benchmarks demonstrating the cases where DTCM heap offers a practical performance improvement.

As a library with a github page, I could add a brief mention and link. Hopefully that would give a good balance between offering people maximum capability without requiring them to digest even more complexity.
 
@Paul - Good morning,

T4.1 product page - Memory - Lots of new good stuff!
Should we put comments here or in the needs updates... thread?

Maybe some of this would go on to the new page about memory you mention.

Things like:

Picture showing PSRAM, maybe should instead of 8192-16384K should maybe be: 0 or 8192 or 16384 to maybe stress that by default, unless the user (or resales) has added a chip or two on bottom, than this area will not exist... Sorry for my bad wording

Picture shows extmem_malloc, what happens if I call this and I don't have any PSRAM? What happens to my EXTMEM variables if I don't have any EXTMEM?
Can EXTMEM variables be initialized? (still mentioned as todo in startup.c)

RAM1 (DTCM/ITCM) - Not sure again how to mention, but ITCM grows in 32KB chunks, so if code is 32KB+1b it takes 64KB...

RAM2 is optimized for access by DMA? - Sorry I never really understood this. Maybe might need to expand on this. For example if you use this area of memory for DMA, you then need to know about the memory caching and probably to explain the need for calls like:
Code:
arm_dcache_flush((uint8_t *)buf, count);
arm_dcache_delete(retbuf, count);
For example, when should you call these? Start of operation, end of operation? Before/after each Read or Write? Again maybe there is a need for a page on DMA?
(Side Note: As DMA is mentioned several times on the page, you might want to mention, What is DMA? Yes you say Direct Memory Access, but not sure if it would help users if said something like: DMA is used by some sub-systems to be able to do input or output operations without typing up the main processor? Again sorry bad wording.

Side Note: I am not sure if DMASettings can work out of DMAMEM area. Especially when they are chained to each other... At least it did not work when I originally tried it.
Ran into this for example when ST7735_t3 code was setup with uncannyEyes example, the DMA operations were crashing when the sketch code was
doing: mytft = new ST777735_t3(....);
I was able to get it to work by making the DMASettings to be static members where I allocated enough of them for at least one DMA display per SPI buss. Which is not ideal.

----

As for malloc, versus ext_malloc and maybe dtcm_malloc()... Again it is nice to have this ability. At times it would be nice to have a unified allocate/free setup, but...

Again side note: Earlier when I was playing with an ESP32, I ported over part of my ILI9341_t3n code to ESP32. And when I tried to allocate a frame buffer it would fail, but if I tried allocating it in two parts, it would succeed. Wonder if their memory is split up as well? I will try it again soon, as I ordered a Sparkfun MicroMod... To prepare for maybe a Teensy version at some point ;)
 
Picture showing PSRAM, maybe should instead of 8192-16384K should maybe be: 0 or 8192 or 16384 to maybe stress that by default, unless the user (or resales) has added a chip or two on bottom, than this area will not exist...

Done.


Picture shows extmem_malloc, what happens if I call this and I don't have any PSRAM? What happens to my EXTMEM variables if I don't have any EXTMEM?

In the text, under Dynamic Allocation -> External Heap it says "When no PSRAM is present, extmem_malloc() automatically allocated memory from the normal heap in RAM2."


Can EXTMEM variables be initialized?

Nope. I'd added words "These variables can not be initialized, your program must write their initial values, if needed."


RAM1 (DTCM/ITCM) - Not sure again how to mention, but ITCM grows in 32KB chunks, so if code is 32KB+1b it takes 64KB...

Other that the "FASTRUN Unused" in the picture, this is one of many details I'm considering to be too small to list on this top-level page.




RAM2 is optimized for access by DMA? - Sorry I never really understood this. Maybe might need to expand on this.

I've added "Normally large arrays & data buffers are placed in RAM2, to save the ultra-fast RAM1 for normal variables."


For example if you use this area of memory for DMA, you then need to know about the memory caching and probably to explain the need for calls like:
....
Again maybe there is a need for a page on DMA?

Yes, a page specifically about DMA is needed for those sorts of details.

This top-level page has main goals. In order of importance:

1: Show Teensy 4.1's many capabilities. Moreso than any other page, this is the sales pitch.

2: Provide links to the detailed interior pages (at least the ones which exist so far). We've all seen this come up over and over on this forum, where someone has a question about something like serial port capability which is answered on the serial page, but they didn't ever find that page. The 2nd highest priority is not to document everything here, but to mention it in a way people can find and discover the links to pages with the detailed info.

3: Answer some of the most common questions (before they're even questions) by highlighting certain features.

4: Reference material. While this is the least of 4 goals, quite a bit of reference material is going on the page. I've been trying to keep most of it in the last "Technical Information" section and mostly include things which are images rather than lots of text to read.

Detailed info about DMA and cache management is so far beyond the scope of this top level page. It really needs a dedicated page inside the site.


As for malloc, versus ext_malloc and maybe dtcm_malloc()... Again it is nice to have this ability. At times it would be nice to have a unified allocate/free setup, but...

Right now I'm focusing on documentation. So yeah, I probably should have asked on the other thread about website updates.

Indeed a unified malloc() which automatically manages all 3 memories would be pretty awesome. So would DMA tutorials, serial NAND bad block management, seemless transition between audio play objects, massive multi-channel audio in/out, USB video & webcam support, releasing the bootloader chips, supporting encrypted & authenticated code, WebUSB, better API for USB host detection of USB device connect & disconnect, a high-performance alternative to LittleFS, the installer detecting & warning for library override conflicts, a non-blank back side of the Teensy 4.1 card, and a ton of other stuff.

My general plan is to leave malloc() and extmem_malloc() as they are, so I can focus on other stuff. There are only so many hours in every day (and sadly, a lot less for me until we can rehire after the pandemic social distancing requirement ease up). Messing with malloc again just isn't on my dev time priority list.

But if you or Luni or anyone else writes a good library for dynamic allocation of DTCM / ITCM, I'll be happy to give it a brief mention and link from the Teensy 4.0 & 4.1 pages.
 
When memory questions arise seeing the map is helpful.

Adding this ",--print-memory-usage" to any of the Teensy boards.txt ( or boards.local.txt ) gives a more detailed look at memory use like:

Code:
teensy41.build.flags.ld=-Wl[B],--print-memory-usage[/B],--gc-sections,--relax "-T{build.core.path}/imxrt1062_t41.ld" 
teensyMM.build.flags.ld=-Wl,--print-memory-usage,--gc-sections,--relax "-T{build.core.path}/imxrt1062_mm.ld"
teensy40.build.flags.ld=-Wl,--print-memory-usage,--gc-sections,--relax "-T{build.core.path}/imxrt1062.ld"
teensy36.build.flags.ld=-Wl,--print-memory-usage,--gc-sections,--relax,--defsym=__rtc_localtime={extra.time.local} "-T{build.core.path}/mk66fx1m0.ld"
teensy35.build.flags.ld=-Wl,--print-memory-usage,--gc-sections,--relax,--defsym=__rtc_localtime={extra.time.local} "-T{build.core.path}/mk64fx512.ld"
teensy31.build.flags.ld=-Wl,--print-memory-usage,--gc-sections,--relax,--defsym=__rtc_localtime={extra.time.local} "-T{build.core.path}/mk20dx256.ld"
teensy30.build.flags.ld=-Wl,--print-memory-usage,--gc-sections,--relax,--defsym=__rtc_localtime={extra.time.local} "-T{build.core.path}/mk20dx128.ld"
teensyLC.build.flags.ld=-Wl,--print-memory-usage,--gc-sections,--relax,--defsym=__rtc_localtime={extra.time.local} "-T{build.core.path}/mkl26z64.ld"


Not sure if that is an easy enough addition?
 
Back
Top