Teensy 4.1 linker script is really wrong

[...]
I use "freertos-teensy" in my project. It allocates the task stacks on the MCU stack. So, it "steals" stack size:
even "free for local variables:xxxx" looks large, the freertos grabs a lot from it - no idea how much is left really for local variables.
This CANNOT be seen neither reported by the compiler and linker script. How much of stack will be "pre-allocated" due to tasks/threads? Get a clue how much stack size is allocated,
by third party code (freertos), by the FW code used...
[...]

freertos-teensy (or FreeRTOS in general) doesn't allocate the task's stacks on the MCU stack. If you use xTaskCreate() to create a task, the stack is allocated with malloc() on the heap. If you use xTaskCreateStatic(), you must pass a pointer to an array defined by yourself to be used for the stack. The stack size must be passed as a parameter to xTaskCreate() or xTaskCreateStatic(), so you need to select it and therefore should know how much memory is reserved for a task's stack. See the documentation for more details.
 
For my huge project, my requirements (in terms of memory space, performance, code size) - I "have" to divert to another board.

Hopefully you will find hardware and software support that meets your needs.

But regarding MaaxBoard RT which you mentioned in msg #17, which does have twice as much RAM as Teensy 4.1, I believe you will discover it has exactly the same 512K FlexRAM which can be partitioned between ITCM vs DTCM. The other 1.5M is slower 1.25M OCRAM (aka DMAMEM) and 256K is meant for the M4 core. Maybe another 768K in DMAMEM will help? But you won't get any more fast ITCM / DTCM by switching to MaaxBoard RT.

However, you will get NXP's SDK rather than Teensy's core library. I'm pretty sure their linker script is designed for the more conventional fixed partition approach.
 
I agree with you

freertos does malloc:
Code:
sm_set_pool(&extmem_smalloc_pool, &_extram_end,
			external_psram_size * 0x100000 -
			((uint32_t)&_extram_end - (uint32_t)&_extram_start),
			1, NULL);
	} else {
		// No PSRAM
		memset(&extmem_smalloc_pool, 0, sizeof(extmem_smalloc_pool));
	}

But where does this memory for malloc() come from (where is it allocated memory)?

It can only come from DTCM or DMAMEM (but not EXTMEM).

"freertos" steals memory (and does not report how much remains available).

Sorry to "simplify":
With "freertos" more memory space is now allocated (used) - during runtime. How much: is not reported if can be seen by linker script (compile log).
(this was my point)

Yes, malloc() takes the memory from DMAMEM:
file "malloc.c":
Code:
extern char __heap_start;
Linker script:
Code:
_heap_start = ADDR(.bss.dma) + SIZEOF(.bss.dma);

It remains as: "freertos steals memory and it CANNOT be reported on the compile log" (how much, or resulting in "out of memory" during runtime).

Sorry , "freertos" does not grab from DTCM (and stack) space - it seems to grab from "DMAMEM".
(also causing a crash if I have too much data on DMAMEM).
 
Thank you Paul.
I "know" (the support for this board is pretty b..)
We will see.

I love the Teensy 4.1 board, I like the forum, I like your expertise response.

My project is "so huge" that I have to start tweaking the memory spaces.
(which is still possible with Teensy 4.1).
Just lazy, changing to a board with more memory, with a real debugger.

Thank you (for doing a great job to support Teensy 4.1).
 
@PaulStoffregen
With all the explanations and examples it has been a great learning exercise in the linker. Even remember this thread that @KurtE started - a very long time ago - https://forum.pjrc.com/threads/5732...ense-of-the-different-regions?highlight=ocram.

Even got me to fix Frank B's PowerButton library to dump all the memory regions. Even ran it on the last example in post #24 which gave me:
Code:
FLASH: 502784  6.19% of 7936kB (7623680 Bytes free) FLASHMEM, PROGMEM
ITCM:  483920 98.45% of  480kB (   7600 Bytes free) (RAM1) FASTRUN
PSRAM: none
OCRAM:
   524288 Bytes (512 kB)
-   12416 Bytes (12 kB) DMAMEM
-     208 Bytes (0 kB) Heap
   511664 Bytes heap free (499 kB), 12624 Bytes OCRAM in use (12 kB).
DTCM:
    32768 Bytes (32 kB)
-    6944 Bytes (6 kB) global variables
-    1408 Bytes (1 kB) max. stack so far
=========
    24416 Bytes free (23 kB), 8352 Bytes in use (8 kB).
and shows that if you add up ITCM+DTCM it comes to 512kb. Works every time with the sketches I tested.

And if I changed the linker script as suggested to 256k/256k guess what it fails!!!
Code:
c:/users/merli/appdata/local/arduino15/packages/teensy/tools/teensy-compile/11.3.1/arm/bin/../lib/gcc/arm-none-eabi/11.3.1/../../../../arm-none-eabi/bin/ld.exe: C:\Users\Merli\AppData\Local\Temp\arduino\sketches\5B6013AE85C2FB40F554A89D1791588D/sketch_jul30a.ino.elf section `.text.itcm' will not fit in region `ITCM'
c:/users/merli/appdata/local/arduino15/packages/teensy/tools/teensy-compile/11.3.1/arm/bin/../lib/gcc/arm-none-eabi/11.3.1/../../../../arm-none-eabi/bin/ld.exe: region `ITCM' overflowed by 221780 bytes
collect2.exe: error: ld returned 1 exit status

exit status 1

Compilation error: exit status 1

This is just for what its worth
 
Yes, if you change now linker script - it fails (and "should").
And it "should fail" (telling me that not memory space left).

My "concern" was: with original linker script - it hides the issue that you can "get out of memory" (no compile errors, but run-time errors).

It think, we should close the thread, the topic.
All is OK, the Linker Script is correct,
just to be aware of: "my project can crash by just adding a few more lines of code".
 
It remains as: "freertos steals memory and it CANNOT be reported on the compile log" (how much, or resulting in "out of memory" during runtime).

Sorry , "freertos" does not grab from DTCM (and stack) space - it seems to grab from "DMAMEM".
(also causing a crash if I have too much data on DMAMEM).

Not really, it takes as much memory as you tell it to use for the stacks. If you use a constant value for the stacksize of each task (as most users do), you know the amount of used memory at compile time. If you don't want to take memory from the heap (DMAMEM) then simply use xTaskCreateStatic(). The used memory is then reported in the compile log. It's up to you to use it in the right way for your needs.

If you want to check how much RAM is used and available at runtime, you can use the functions freertos::ram1_usage() and freertos::ram2_usage() that I've added to the port. It's basically a wrapper around mallinfo() which is a standard libc function that gives you information about the dynamic memory usage at runtime.
 
Yes, if you change now linker script - it fails (and "should").
And it "should fail" (telling me that not memory space left).

My "concern" was: with original linker script - it hides the issue that you can "get out of memory" (no compile errors, but run-time errors).

It think, we should close the thread, the topic.
All is OK, the Linker Script is correct,
just to be aware of: "my project can crash by just adding a few more lines of code".

Sorry, maybe I am missing something obvious. I thought that is why we added the teensy_size app to the build process:

Example build
Code:
"C:\\Users\\kurte\\AppData\\Local\\Arduino15\\packages\\teensy\\tools\\teensy-tools\\0.59.2/teensy_size" "C:\\Users\\kurte\\AppData\\Local\\Temp\\arduino\\sketches\\AB88CCE6F6BA9C3A97C62FDC84D63CCC/KurtsRA8875_FB_and_clip_tests.ino.elf"
Memory Usage on Teensy 4.0:
  FLASH: code:59640, data:20904, headers:8540   free for files:1942532
   RAM1: variables:24736, code:57872, padding:7664   free for local variables:434016
   RAM2: variables:12416  free for malloc/new:511872
And this code understands DTCM and ITCM stuff and will exit with error code if the two won't fit. It will also say something like -400 bytes free for local...

Beyond that I am not sure what we can do, if your code is failing as to running out of stack space or heap space. Other than the suggestions that have been given before and the recent one by @mjs513 shows Frank's library for stack space detecting.

Now if it is due to things like threads or the like, who malloc stack space, you can often help detect space problems if you preallocate space for each of these. Then the linker and other tools can help detect that.
 
My "concern" was: with original linker script - it hides the issue that you can "get out of memory" (no compile errors, but run-time errors).

You do indeed get a compile error if your ITCM + DTCM usage is over 512K. This compile time check and error is built into teensy_size, because the linker can't do it.

Of course stack overflow can't be checked at compile time. No linker script can do that.

If you're using a RTOS and each task has its own stack, you can overflow any of those and correct whatever else happens to be in memory.
 
But perhaps we should modify teensy_size to also give a non-fatal warning if the unused DTCM is less than some reasonable amount of space for stack? I believe Arduino's default size printing has that feature, but it was based on a percentage and designed around Arduino Uno where the entire RAM is only 2048 bytes.

The big question is what that unused DTCM threshold number should be? Obviously less than a couple hundred bytes is sure to result in a stack overflow for most programs. But if we set the threshold too large, we'll just create a needless warning that people ignore.
 
But perhaps we should modify teensy_size to also give a non-fatal warning if the unused DTCM is less than some reasonable amount of space for stack?

I don't think that's necessary, but I think it would be helpful for both teensy_size and the diagrams on the PJRC T4.0 and T4.1 pages to use the terms "heap" and "stack" instead of, or perhaps in addition to, "new/malloc" and "local variables". Stack has to accommodate more than local variables, so I find that misleading.
 
Sorry,
last topic was:
reducing ITCM and/or DTCM space in linker scripts results in compiler error.
This is "intended" (and should happen):
I want to see when DTCM gets below a minimum.
(I want to make sure, if I have at least 256K for DTCM - and not let reduce "automatically" (and hidden) by just adding more code).

Sure, a linker script cannot "check" for stack overflows, but it could make sure, I have enough stack space
(my minimum stack space required is specified and checked: otherwise give me a compiler/linker error, e.g. due too much code).

Never mind:
No need to keep arguing and fighting with/against me.
The case is closed for me (I found my root cause).
(you can keep going to consider/treat me as an "idiot"... if you think it is a nice attitude... to get me back to your forum)
 
The correct way to do that would be to statically allocate the stack space and use the correct thread creation functions.

I really doubt FreeRTOS isn't returning an error when it tries to create a new thread and fails to malloc the stack, so this is likely a simple coding mistake (not checking return values or failure to gracefully handle errors).
 
I really doubt FreeRTOS isn't returning an error when it tries to create a new thread and fails to malloc the stack, so this is likely a simple coding mistake (not checking return values or failure to gracefully handle errors).

Since you mentioned this, I took another quick look at the code. A quick "grep" search finds 6 places calling xTaskCreate().

https://github.com/tjaekel/Teesny_4...4e27c42e836eb5edf359aaa726/CMD_thread.cpp#L42

https://github.com/tjaekel/Teesny_4_1/blob/53c435c7d90fca4e27c42e836eb5edf359aaa726/GPIO.cpp#L223

https://github.com/tjaekel/Teesny_4_1/blob/53c435c7d90fca4e27c42e836eb5edf359aaa726/GPIO.cpp#L224

https://github.com/tjaekel/Teesny_4...e27c42e836eb5edf359aaa726/TCP_Server.cpp#L206

https://github.com/tjaekel/Teesny_4_1/blob/53c435c7d90fca4e27c42e836eb5edf359aaa726/TFTP.cpp#L110

https://github.com/tjaekel/Teesny_4_1/blob/53c435c7d90fca4e27c42e836eb5edf359aaa726/UDP_send.cpp#L47

None of them check the return value to handle success versus failure. But perhaps the problem isn't whether malloc() was able to allocate the requested stack on the heap, but whether the stack is big enough for the success case where the thread starts running? There are the stack sizes I found (with more "grep") inside SYS_config.h.

Code:
/* thread definitions */
/* stack size is in number of words, not bytes? */
#define THREAD_STACK_SIZE_CMD         ((2*1024) / 1)
#define THREAD_STACK_SIZE_GPIO        ((2*1024) / 1)
#define THREAD_STACK_SIZE_HTTPD       ((1*1024) / 1)
#define THREAD_STACK_SIZE_TFTP        ((1*1024) / 1)
#define THREAD_STACK_SIZE_UDP         ((1*1024) / 1)

The 3 threads for network stuff have only 1024 words, which would be 4K. Maybe that's on the skimpy side? On the earlier thread when we discussed the compiler options to list the stack frame sizes, many of the QNEthernet functions regarding DNS had stack frames over 1K. I have no idea of those are really being used, but if they are... it wouldn't take much to overflow a tiny 4K stack.


The correct way to do that would be to statically allocate the stack space and use the correct thread creation functions.

Agreed. If I were writing this code, I would definitely go with xTaskCreateStatic() and use static DMAMEM buffers.
 
But perhaps the problem isn't whether malloc() was able to allocate the requested stack on the heap, but whether the stack is big enough for the success case where the thread starts running?

Also very possible, multithreading without an MMU to track stack usage can be quite a challenge...
Both possibilities still don't explain the initial failure outlined in the opening post (where the program crashes when using >256KB of ITCM), that *has* to be an overflow of the initial DTCM stack... and we don't know if it's hitting the 32-byte NO ACCESS region set up by the MPU, managing to jump right over it by allocating a large amount of local space (e.g. "alloca(1024)"), or something else entirely.
 
Yes indeed. This need to manage stack sizes while using an enormous Arduino ecosystem of libraries that don't come with any guidance on stack usage and (mostly) weren't designed and tested to be thread safe is the main reason I've resisted building the Teensyduino core library around a preemptive RTOS. They come with some nice benefits, but this uncertainty about so many stacks possibly overflowing is a huge hidden cost and a really difficult problem for anyone who creates a program that would want to use a lot of the memory.
 
Yes indeed. This need to manage stack sizes while using an enormous Arduino ecosystem of libraries that don't come with any guidance on stack usage and (mostly) weren't designed and tested to be thread safe is the main reason I've resisted building the Teensyduino core library around a preemptive RTOS. They come with some nice benefits, but this uncertainty about so many stacks possibly overflowing is a huge hidden cost and a really difficult problem for anyone who creates a program that would want to use a lot of the memory.

In the ESP world, this is usually not a big problem. The libaries take this into account, and have been adapted in the past.
But of course, if you never start, or can't start because there is no PJRC supported RTOS, there will never be adapted libraries.
However, using dual core MCUs like the often mentioned 1170 without RTOS sensibly and especially efficiently will be extremely exciting and is doomed to fail in my opinion.
Sooner or later, one core will end up waiting unnecessarily for the other - and that cannot be seriously wanted.

But that is another topic.
 
Last edited:
...and you already have this situation.

There are many problems here in the forum because the oh so fast 4.x is waiting for something. Then you wonder what all the MHz are actually good for, and certain things in the end are not faster than on a UNO, when using the std. Teensyduino.
 
Back
Top