Ok, got it. If this is permanent I'll log an issue at TyCommander. For the time being I can work with the dual serial.
Looks like you allocate all of the remaining free space in EXTRAM for the EXTRAM heap. This is a fantastic idea. I was only using user defined heap buffers so far but this makes it much more easy to use of course
The reported free space is the total free space not the largest free chunk. Don't know if you saw my video here https://www.youtube.com/watch?v=s3U5QSO7Rd8 It prints out the reported stats on the right side.Here's a quick test which queries the free space, then tries to allocate it all at once. Fails.
BTW: it is still possible to use the full functionality of sm_alloc. E.g. Add an additional Heap on DMAMEM:
Which prints:Code:DMAMEM char dmaBuffer[10000]; smalloc_pool dmaHeap; void setup() { while (!Serial){} sm_set_pool(&dmaHeap, dmaBuffer, 10000, false, nullptr); char *extMemChunk = (char *)extmem_malloc(100); char *dmaMemChunk = (char *)sm_malloc_pool(&dmaHeap,100); Serial.printf("extMemPtr: %p dmaMemPtr %p\n", extMemChunk, dmaMemChunk); } void loop() { }
Maybe do the same pre allocation for DMAMEM as for EXTMEM to make usage simpler?Code:extMemPtr: 0x7000000c dmaMemPtr 0x2020000c
@luni
Ok think you need to give me another lesson here. Using your printInfo function which boils down to basically:
should return bytes for total, totalUser and totalFree?Code:size_t total, totalUser, totalFree; int nrBlocks; sm_malloc_stats(&total, &totalUser, &totalFree, &nrBlocks); Serial.printf(" %u %u %u %u\n", total, totalUser, totalFree, nrBlocks);
So if I add it to my example along with some of Paul's example I get:
Here is what I am using:Code:{total, totalUser, totalFree, nrBlocks 1028 1074774016 352 536874412 {pauls example"} Free Space: 8388552 Total: 48 Initializing values... Initialized values 1 3 5 7 9
Code:#include "Streaming.h" #define cout Serial #include "smalloc.h" void setup() { while(!Serial) ; int *ptr; ptr = (int*) extmem_malloc(5*sizeof(int)); size_t total, totalUser, totalFree; int nrBlocks; sm_malloc_stats(&total, &totalUser, &totalFree, &nrBlocks); Serial.printf(" %u %u %u %u\n", total, totalUser, totalFree, nrBlocks); Serial.print("Free Space: "); size_t total1 = 0, freespace = 0; sm_malloc_stats_pool(&extmem_smalloc_pool, &total1, NULL, &freespace, NULL); Serial.println(freespace); Serial.print("Total: "); Serial.println(total); if(!ptr) { cout << "Memory Allocation Failed" << endl; exit(1); } cout << "Initializing values..." << endl << endl; for (int i=0; i<5; i++) { ptr[i] = i*2+1; } cout << "Initialized values" << endl; for (int i=0; i<5; i++) { /* ptr[i] and *(ptr+i) can be used interchangeably */ cout << *(ptr+i) << endl; } extmem_free(ptr); } void loop() { // put your main code here, to run repeatedly: }
Sorry, I ment DTCM of course
Code:#include "smalloc.h" char dtcmBuffer[100000]; smalloc_pool dmaHeap; void setup() { while (!Serial){} sm_set_pool(&dmaHeap, dtcmBuffer, 10000, false, nullptr); char *extMemChunk = (char *)extmem_malloc(100); char *dtcmMemChunk = (char *)sm_malloc_pool(&dmaHeap,100); Serial.printf("extMemPtr: %p dtcmMemPtr %p\n", extMemChunk, dtcmMemChunk); }Code:extMemPtr: 0x7000000c dtcmMemPtr 0x20000d50
Paul didn't set up the default pool, so sm_malloc_stats won't give correct values. You need to use the pool version:
Which prints the correct values:Code:void setup() { while (!Serial); int* ptr = (int*)extmem_malloc(5 * sizeof(int)); size_t total, totalUser, free; int blocks; sm_malloc_stats_pool(&extmem_smalloc_pool, &total, &totalUser, &free, &blocks); Serial.printf( "Total used: %u bytes\n" "Total user vars: %u bytes\n" "Free space: %.4f MB\n" "Allocated blocks: %u\n\n", total, totalUser, free / 1024 / 1024.0f, blocks); }
Each allocated chunk of memory is contained in one block. The block starts with a 12byte header, followed by the user data + some fill bytes and a closing 12byte tag.Code:Total used: 48 bytes Total user vars: 20 bytes Free space: 15.9990 MB Allocated blocks: 1
The fill bytes will fill up the user data to n*12 bytes. The returned pointer points to the beginning of the user data of course. If you want to get a pointer to the block you can use the macro USER_TO_HEADER as shown in #11.
Hope that answers your question somehow?
Maybe we should have a way to pad allocations to 32-byte cache row boundaries?
@luni
Yep that answers my question including my unasked question about what is a chunk - thanks for explanation!
From the readme:
So, as it is now the block size needs to be a multiple of he header size. If I understand the readme correctly the header has 4bytes for the block size, 4bytes for the size of the user memory and 4bytes for a hash -> My gut feeling is that 32bit blocks will need some major rewriting of the library. Might be interesting of course...Searches are done by shifting a header-wide pointer across the pool.
Allocated block is found by testing each possible header for validity.
Indeed, seems likely we'll eventually have to make substantial changes to smalloc, or replace it with some other memory management scheme. For now (version 1.54) this is probably good enough. Long-term, we probably do need alignment to cache rows to avoid thorny issues when people try to use DMA on their allocated memory.
I've added a comment in smalloc.h to advise against using its API directly from Arduino sketches & libraries.
https://github.com/PaulStoffregen/co...e918ce498a7df3
If we do end up changing the underlying memory management, at least there's a warning that smalloc.h may change in future versions.
Sounds good. My guess is that DMA from EXTMEM should also be 32 byte aligned as well? Note: Currently I don't think malloc is 32 byte aligned either.
Which is why in some of our display drivers that do DMA we have things like:
I was playing around with reading the OV... Camera (640x480 2 bytes) and trying to output to ILI9488 (480x320);Code:_we_allocated_buffer = (uint8_t *)malloc(CBALLOC+32); if (_we_allocated_buffer == NULL) return 0; // failed _pfbtft = (RAFB*) (((uintptr_t)_we_allocated_buffer + 32) & ~ ((uintptr_t) (31)));
I had the Camera buffer allocated using EXTMEM and I allocated the frame buffer using new memory allocater ...
Will probably move the frame buffer back to DMAMEM
Re: Loss of Teensy with current cores about 10pm 10/26? was 4 hours old then ...
Sent over code and tyComm could not see them.
Switched IDE to TeensyLoader
Did 15s Restore and the Red LED for Button bootloader no longer appears.
They boot and blink - button stops Orange blink - but no RED led appears
Cannot program T_4.1 or T_4.0
USBView from MSFT and nirsoft USBDevView don't indicate anything Teensy.
Rebooted just 38 hours ago - suppose will do it again to see ...
Moved the T_4.0 and 4.1 in turn to cable on USB Port - Both act normally there
But it seems the HUB is Offline????
Umplugged and moved HUB to another port and it seems okay now ???
Ive had success using tinyalloc before https://github.com/thi-ng/tinyalloc
Initialised something like this. Chunk sizes etc could be adjusted to suit your application.
This allows use of ta_alloc, ta_free, ta_calloc etc. directly into external RAM.Code://Init external RAM and memory heap EXTMEM uint8_t ext_ram[1]; //Just to get EXTMEM pointer extern uint8_t external_psram_size; //in MB. Set in startup.c void init_memory() { uint32_t psram_bytes = 1024 * 1024 * external_psram_size; ta_init((void *)(ext_ram), //Base of heap (void *)(ext_ram + psram_bytes), //End of heap psram_bytes / 32768, //Number of memory chunks (32k/per chunk) 16, //Smaller chunks than this won't split 32); //32 word size alignment }
IMO, the memory allocation details should be abstracted away for the typical use case. Ie, a single mallocX() call where you optionally pass it a hint as to the speed desired (or other requirements like DMA alignment). If the requested speed section is full, it automatically provides the next fastest available memory.
Not sure I followed - memory allocated for C++ classes now comes from DMAMEM but would be significantly faster if it came from DTCM? Perhaps this could be done automatically - the first x bytes are allocated from DTCM and after that it comes from DMAMEM.
I partially agree. That is maybe a new API can be created like: heapAlloc or the like which maybe take in optional flags on which heap to give preference...
However I don't see us replacing malloc/free api signature as maybe too risky to existing sketches.
I would also very much like some ability to allocate memory out of DTCM. How much? Would depend on how much of the memory is already used, and probably some way to reserve enough space for Stack. Hopefully with some way for the sketch to set that limit...
Yes and no.
Yes, C++ classes, if created by "new", are indeed allocated in the OCRAM accessed by the slower AXI bus.
But usually the speed different isn't "significantly faster". Typically there is little if any speed difference, thanks to the 32K L1 caches. While the AXI bus does use a slower clock, it's still a 64 bit wide bus with many advanced features. It's no slouch (like the PSRAM chip) for cache misses.
Originally Posted by KurtE
Actually, smalloc is able to use more than one memory pool at the same time. Here an example how to setup an additional pool on DTCM. This example also shows how to use use placement new to construct c++ objects in this pool. It is a bit clumsy but, AFAIK, there is no other possibility in c++ without redefining 'new'.Originally Posted by jonr
which prints:Code:#include "smalloc.h" #include <new> uint8_t dtcmBuffer[100*1024]; // Generate a 100kB memory pool on DTCM smalloc_pool dtcmPool; IntervalTimer* timer; // to be constructed on DTCM void setup() { while (!Serial){} sm_set_pool(&dtcmPool, dtcmBuffer, 100*1024, true, nullptr); // initialize pool, zero allocated memory, no out of memory callback uint32_t* u1 = (uint32_t*)sm_malloc_pool(&dtcmPool,sizeof(uint32_t)); // one uint32_t on DTCM char* text = (char*)sm_malloc_pool(&dtcmPool,100); // c-string, 100 bytes on DTCM *u1 = 100; text = strcpy(text, "Hello World"); // c++ objects on dtcm heap: void* mem = sm_malloc_pool(&dtcmPool, sizeof(IntervalTimer)); // allocate memory for the timer object timer = new(mem) IntervalTimer(); // placement new to construct object in allocated memory chunk timer->begin([] { Serial.println(millis()); }, 200'000); // setup timer to print millis() every 200ms Serial.printf("var: u1 addr->%p content->%u\n", u1, *u1); Serial.printf("var: text addr->%p content->%s\n", text, text); } void loop(){ }
Code:var: u1 addr->0x20000d90 content->100 var: text addr->0x20000db4 content->Hello World 587 787 987 1187 ...
Last edited by luni; 01-01-2021 at 08:06 AM.
Thanks @luni -
Yes, I know that you can have more than one memory pool. The interesting question is, is there a top level api, or class or... That for example if there are three memory pools or 4?
of: PSRAM(Real slow), DMAMEM(Sort of Slow), DTCM(Fast), ITCM(fast and desperate) that can be setup, that for example, I can choose which of these pools to use depending on some criteria passed in and maybe knowing how much space is free in the different heaps? Also one where you can hopefully just pass in your pointer to something like heapFree() and it will know which pool that belongs to...
I also wonder about if there should be a more automatic way to get the DTCM heap than:
That may work in one sketch but not another and it also masks how much space the sketch is actually using. Where for example if your run a sketch on a T3.6, the heap starts just after all defined variables and grows up toward the stack. When you build the sketch on T3.x, the linker output will give you a hint of just how much data space you have defined within your sketch. But with these big defines that may or may not be used up, you don't see that information.Code:uint8_t dtcmBuffer[100*1024]; // Generate a 100kB memory pool on DTCM smalloc_pool dtcmPool;
So a question is, should there be a DTCM heap setup on T4.x which does similar? That it is lowest memory is at the point just after the end of all variables, and grows up to some maximum. Maybe by default have DTCM_HEAP_END or the like defined at the high stack pointer minus some reserved stack space....
Maybe this is a good moment to mention I've been continuing to work on the Teensy 4.1 page. Quite a lot was (hastily) written on the Teensy 4.0 page about memory. I've been converting it to the new format, and adding EXTMEM stuff, including an updated diagram with PSRAM and LittleFS in the flash.
https://www.pjrc.com/store/teensy41.html#memory
Please take a look and let me know if I've missed anything important?
I'm planning to add another page with examples and discussion of the performance for each memory. The Teensy 4.0 page has some of that performance discussion, which I've intentionally left off the 4.1 page (and will disappear from the 4.0 page soon) because it's meant to go to a dedicated page and get rewritten with actual benchmarks.
No API or C++ class exists in the core library for heap on DTCM or unused ITCM.
Maybe. The "should" part is a difficult question.So a question is, should there be a DTCM heap setup on T4.x which does similar?
On the plus side, it makes more memory available and maybe allows certain use cases to achieve better performance. But the performance part is rather questionable, since the AXI bus is still quite fast and M7's 32K cache probably closes almost all the performance gap for common use cases. The main downside is adding even more complexity to any already pretty complicated memory system, on a chip loaded with a tremendous amount of advanced but complex features.
My gut feeling is DTCM heap belongs in a library which users can install if they want an easy way to get a 3rd heap. A library could have its own readme on github or a dedicated web page to explain how to make use of another heap. Ideally such a library would have several examples, hopefully some with benchmarks demonstrating the cases where DTCM heap offers a practical performance improvement.
As a library with a github page, I could add a brief mention and link. Hopefully that would give a good balance between offering people maximum capability without requiring them to digest even more complexity.
@Paul - Good morning,
T4.1 product page - Memory - Lots of new good stuff!
Should we put comments here or in the needs updates... thread?
Maybe some of this would go on to the new page about memory you mention.
Things like:
Picture showing PSRAM, maybe should instead of 8192-16384K should maybe be: 0 or 8192 or 16384 to maybe stress that by default, unless the user (or resales) has added a chip or two on bottom, than this area will not exist... Sorry for my bad wording
Picture shows extmem_malloc, what happens if I call this and I don't have any PSRAM? What happens to my EXTMEM variables if I don't have any EXTMEM?
Can EXTMEM variables be initialized? (still mentioned as todo in startup.c)
RAM1 (DTCM/ITCM) - Not sure again how to mention, but ITCM grows in 32KB chunks, so if code is 32KB+1b it takes 64KB...
RAM2 is optimized for access by DMA? - Sorry I never really understood this. Maybe might need to expand on this. For example if you use this area of memory for DMA, you then need to know about the memory caching and probably to explain the need for calls like:
For example, when should you call these? Start of operation, end of operation? Before/after each Read or Write? Again maybe there is a need for a page on DMA?Code:arm_dcache_flush((uint8_t *)buf, count); arm_dcache_delete(retbuf, count);
(Side Note: As DMA is mentioned several times on the page, you might want to mention, What is DMA? Yes you say Direct Memory Access, but not sure if it would help users if said something like: DMA is used by some sub-systems to be able to do input or output operations without typing up the main processor? Again sorry bad wording.
Side Note: I am not sure if DMASettings can work out of DMAMEM area. Especially when they are chained to each other... At least it did not work when I originally tried it.
Ran into this for example when ST7735_t3 code was setup with uncannyEyes example, the DMA operations were crashing when the sketch code was
doing: mytft = new ST777735_t3(....);
I was able to get it to work by making the DMASettings to be static members where I allocated enough of them for at least one DMA display per SPI buss. Which is not ideal.
----
As for malloc, versus ext_malloc and maybe dtcm_malloc()... Again it is nice to have this ability. At times it would be nice to have a unified allocate/free setup, but...
Again side note: Earlier when I was playing with an ESP32, I ported over part of my ILI9341_t3n code to ESP32. And when I tried to allocate a frame buffer it would fail, but if I tried allocating it in two parts, it would succeed. Wonder if their memory is split up as well? I will try it again soon, as I ordered a Sparkfun MicroMod... To prepare for maybe a Teensy version at some point![]()
Done.
In the text, under Dynamic Allocation -> External Heap it says "When no PSRAM is present, extmem_malloc() automatically allocated memory from the normal heap in RAM2."Picture shows extmem_malloc, what happens if I call this and I don't have any PSRAM? What happens to my EXTMEM variables if I don't have any EXTMEM?
Nope. I'd added words "These variables can not be initialized, your program must write their initial values, if needed."Can EXTMEM variables be initialized?
Other that the "FASTRUN Unused" in the picture, this is one of many details I'm considering to be too small to list on this top-level page.RAM1 (DTCM/ITCM) - Not sure again how to mention, but ITCM grows in 32KB chunks, so if code is 32KB+1b it takes 64KB...
I've added "Normally large arrays & data buffers are placed in RAM2, to save the ultra-fast RAM1 for normal variables."RAM2 is optimized for access by DMA? - Sorry I never really understood this. Maybe might need to expand on this.
Yes, a page specifically about DMA is needed for those sorts of details.For example if you use this area of memory for DMA, you then need to know about the memory caching and probably to explain the need for calls like:
....
Again maybe there is a need for a page on DMA?
This top-level page has main goals. In order of importance:
1: Show Teensy 4.1's many capabilities. Moreso than any other page, this is the sales pitch.
2: Provide links to the detailed interior pages (at least the ones which exist so far). We've all seen this come up over and over on this forum, where someone has a question about something like serial port capability which is answered on the serial page, but they didn't ever find that page. The 2nd highest priority is not to document everything here, but to mention it in a way people can find and discover the links to pages with the detailed info.
3: Answer some of the most common questions (before they're even questions) by highlighting certain features.
4: Reference material. While this is the least of 4 goals, quite a bit of reference material is going on the page. I've been trying to keep most of it in the last "Technical Information" section and mostly include things which are images rather than lots of text to read.
Detailed info about DMA and cache management is so far beyond the scope of this top level page. It really needs a dedicated page inside the site.
Right now I'm focusing on documentation. So yeah, I probably should have asked on the other thread about website updates.As for malloc, versus ext_malloc and maybe dtcm_malloc()... Again it is nice to have this ability. At times it would be nice to have a unified allocate/free setup, but...
Indeed a unified malloc() which automatically manages all 3 memories would be pretty awesome. So would DMA tutorials, serial NAND bad block management, seemless transition between audio play objects, massive multi-channel audio in/out, USB video & webcam support, releasing the bootloader chips, supporting encrypted & authenticated code, WebUSB, better API for USB host detection of USB device connect & disconnect, a high-performance alternative to LittleFS, the installer detecting & warning for library override conflicts, a non-blank back side of the Teensy 4.1 card, and a ton of other stuff.
My general plan is to leave malloc() and extmem_malloc() as they are, so I can focus on other stuff. There are only so many hours in every day (and sadly, a lot less for me until we can rehire after the pandemic social distancing requirement ease up). Messing with malloc again just isn't on my dev time priority list.
But if you or Luni or anyone else writes a good library for dynamic allocation of DTCM / ITCM, I'll be happy to give it a brief mention and link from the Teensy 4.0 & 4.1 pages.
When memory questions arise seeing the map is helpful.
Adding this ",--print-memory-usage" to any of the Teensy boards.txt ( or boards.local.txt ) gives a more detailed look at memory use like:
Code:teensy41.build.flags.ld=-Wl,--print-memory-usage,--gc-sections,--relax "-T{build.core.path}/imxrt1062_t41.ld" teensyMM.build.flags.ld=-Wl,--print-memory-usage,--gc-sections,--relax "-T{build.core.path}/imxrt1062_mm.ld" teensy40.build.flags.ld=-Wl,--print-memory-usage,--gc-sections,--relax "-T{build.core.path}/imxrt1062.ld" teensy36.build.flags.ld=-Wl,--print-memory-usage,--gc-sections,--relax,--defsym=__rtc_localtime={extra.time.local} "-T{build.core.path}/mk66fx1m0.ld" teensy35.build.flags.ld=-Wl,--print-memory-usage,--gc-sections,--relax,--defsym=__rtc_localtime={extra.time.local} "-T{build.core.path}/mk64fx512.ld" teensy31.build.flags.ld=-Wl,--print-memory-usage,--gc-sections,--relax,--defsym=__rtc_localtime={extra.time.local} "-T{build.core.path}/mk20dx256.ld" teensy30.build.flags.ld=-Wl,--print-memory-usage,--gc-sections,--relax,--defsym=__rtc_localtime={extra.time.local} "-T{build.core.path}/mk20dx128.ld" teensyLC.build.flags.ld=-Wl,--print-memory-usage,--gc-sections,--relax,--defsym=__rtc_localtime={extra.time.local} "-T{build.core.path}/mkl26z64.ld"
Not sure if that is an easy enough addition?