Teensy 4.1 linker script is really wrong

tjaekel

Well-known member
Moderator Edit: linker script is fine, see this message for info.


I can confirm now, the linker script for Teensy 4.1 is really wrong:
It has this original definition:
Code:
MEMORY
{
    ITCM (rwx):  ORIGIN = 0x00000000, LENGTH = 512K
    DTCM (rwx):  ORIGIN = 0x20000000, LENGTH = 512K
    RAM (rwx):   ORIGIN = 0x20200000, LENGTH = 512K
    FLASH (rwx): ORIGIN = 0x60000000, LENGTH = 7936K
    ERAM (rwx):  ORIGIN = 0x70000000, LENGTH = 16384K
}
(see the 512K for DTCM and ITCM, both, in total as 1.5MB memory available - not true)

But it should have this definition:
Code:
MEMORY
{
    ITCM (rwx):  ORIGIN = 0x00000000, LENGTH = 256K
    DTCM (rwx):  ORIGIN = 0x20000000, LENGTH = 256K
    RAM (rwx):   ORIGIN = 0x20200000, LENGTH = 512K
    FLASH (rwx): ORIGIN = 0x60000000, LENGTH = 7936K
    ERAM (rwx):  ORIGIN = 0x70000000, LENGTH = 16384K
}
(ITCM and DTCM: each 256K, in sum 512K, startup configures as 256K each)

The difference is:
  • the original one does not fail on compile, does not complain if ITCM overflows! It let me generate more code as possible - which will crash!
  • Also an overflow on the DTCM will not be reported (too much data) - might also crash!
  • at the end: your program will crash and end up in error handler (double flashing LED, dead for flashing again a new FW if you run into it immediately!)
    The generated code overflows the MCU memory size - WITHOUT warning.

When I compile my project with MODIFIED linker script (256K each), I get this as compile results:
Code:
Memory Usage on Teensy 4.1:
  FLASH: code:277352, data:89084, headers:8344   free for files:7751684
   RAM1: variables:183616, code:262104, padding:40   free for local variables:78528
   RAM2: variables:149248  free for malloc/new:37504

It tells me:
RAM1 as 512K total, but 256K ITCM and 256K DTCM, the code: 262104 plus padding: 40 results in 262,144 bytes = exactly 256KB.
And it works.
And it makes complete sense.

But with the original linker script, where ITCM as well as DTCM are set as 512K each - I get this compile result (error free !!!!) - and with a tiny bit more of code in ITCM:
Code:
Memory Usage on Teensy 4.1:
  FLASH: code:277608, data:89084, headers:9112   free for files:7750660
   RAM1: variables:183616, code:262360, padding:32552   free for local variables:45760
   RAM2: variables:149248  free for malloc/new:375040

This is compile error free (!!!!) and tells me:
RAM1 has now code: 262360 plus padding: 32552 (MUCH MORE PADDING NOW!), but the total is: 292,912 bytes = 288 KB (for ITCM).
THIS IS WRONG!!!
This MCU and the startup config has just 256K ITCM (and 256K DTCM)!
It "must" crash - and it does!

I am waiting for UART (luckily):
right afterwards it crashes, with double blinking LED and nothing works anymore.
Not possible to flash this board again in this state (as long as you see a double blinking LED).
Due to waiting for UART - I have still a chance to flash the board again.

Imagine:
I would not wait for UART and run into this "crash" immediately - no way to flash the board again.
It would be dad forever.

The "test source code" is here:
https://github.com/tjaekel/Teesny_4_1

See in file "cmd_dec.cpp" for command "CMD_test" the additional code generated:
  • it fails if you add more code (more NOP() calls) - set the #if 0 to #if 1 at the end of function NOP() ) - it will FAIL!
  • it complains CORRECTLY if you fix the Linker Script (NOT more as 256K ITCM code) - make it correct and it reports correctly
  • it does NOT warn you with original Linker Script (about this overflow issue) - you can generate "too much" code and data without warnings!
 
Getting to the bottom of why your program crashes would require a huge amount of time to debug many thousands of lines. Your code is over 18,000 lines in several dozen files!

But I can quickly and easily show your belief about Teensy 4.1 memory is mistaken. Specifically this:

THIS IS WRONG!!!
This MCU and the startup config has just 256K ITCM (and 256K DTCM)!
It "must" crash - and it does!

If your wrong assumption were actually true (it isn't), then any program with larger than 256K DTCM or larger than 256 ITCM would crash.

In msg #15 on your prior thread, I showed a small program which uses large DTCM. Anyone can quickly copy it into Arduino IDE and upload to Teensy 4.1 to see it does not crash, and the memory usage summary does indeed confirm it allocates large DTCM.

Here is a similar small test program which uses over 460K ITCM. Anyone can also copy this into Arduino IDE and Upload to see it uses large ITCM and does indeed run properly.

Code:
FASTRUN const uint8_t buf[465000] = {} ;

void setup() {
  Serial.begin(9600);
  while (!Serial) ; // wait for serial monitor
  Serial.println("hello world");
  uint8_t c = *(volatile uint8_t *)buf;
  Serial.println(c);
}

void loop() {
}

I do not believe the linker script is defective. I believe the main problem here (aside from some nasty bug within your code) is you have not understood the FlexRAM memory partitioning hardware, and you're applying your knowledge of simpler hardware. I tried to explain in msg #15 on your other thread.

To explain again, briefly, the FlexRAM memory partitioning hardware allows 512K of RAM to be partitioned in 32K chunks between ITCM, DTCM, and OCRAM. Teensy's linker script and startup code automatically partition the RAM for only as much as your program needed in ITCM (rounded up to the nearest 32K chunk) and the rest of the 512K FlexRAM is partitioned to DTCM. We never partition any of it to OCRAM, since there is another 512K OCRAM-only memory inside the chip.

Again, this hardware is documented in the reference manual in chapter 31 starting on page 1783, and also in GPR registers documented on pages 362-363.

Repeating more of what I wrote only 8 days ago, the linker script must declare the maximum possible sizes in the memory section, because of a chicken-and-egg type problem. The actual size is configurable. But until the linker has assigned everything, we don't know what size will actually be needed. This is why you see 512K in the linker script memory section, even though both share a dynamically partitioned 512K physical memory. Later in the linker script, the actual sizes and partition info is computed. After the linker has completed, the teensy_size program does the final check for overflow beyond the actual 512K of physical FlexRAM memory.

Anyone can easily see it does indeed work by running these tiny example programs.

Of course something is wrong, because your program crashes in a bad way. I do not know why your huge program is crashing. But I can say with confidence it is absolutely not because of a fundamental problem in the linker script where any program using more than 256K ITCM or any program using more than 256K DTCM would crash. These small examples, and plenty of much larger "real" programs people have created, do indeed take advantage of larger ITCM or DTCM (but never both larger than 256K) and anyone can run them to see they don't crash.

I'm explaining all this, again, so hopefully you can get past your incorrect belief about the linker script and ideally start focusing on finding what is causing the crash within your program.
 
Last edited:
To expand on this simple test program, here is a slightly longer version which actually sweeps through reading throughout the memory. Just in case anyone suspects the small program doesn't crash only because it never accesses all of the large ITCM.

Code:
FASTRUN const uint8_t buf[465000] = {} ;

void setup() {
  Serial.begin(9600);
  while (!Serial) ; // wait for serial monitor
  Serial.println("hello world");
  uint8_t c = *(volatile uint8_t *)buf;
  Serial.println(c);
}

void loop() {
  static uint32_t index = 0;
  index = (index + 85237) % sizeof(buf);
  Serial.print("buf[");
  Serial.print(index);
  Serial.print("] = ");
  uint8_t c = *(volatile uint8_t *)(buf + index);
  Serial.println(c);
  delay(50);
}
 
This still smells like a stack overflow and I bet if you properly debugged the fault (you said it ends up in an error handler, so what's the specific error?) it would prove it.
 
I looked briefly at this program. It's only 64 files with 19488 lines, depending on 2 libraries. How bad could that be?

Well, so far I have not found any version of Arduino + Teensyduino that successfully compiles. I'm using the latest QNEthernet and freertos-teensy, downloaded directly from github today. Maybe older versions were used? Or maybe this isn't even with any specific Arduino + Teensyduino, but perhaps PlatformIO where the compiler setup could be pretty much anything?!

With Arduino 1.8.19 and Teensyduino 1.59-beta2, I get this:

Code:
Memory Usage on Teensy 4.1:
  FLASH: code:282608, data:88436, headers:8856   free for files:7746564
   RAM1: variables:251648, code:267416, padding:27496   [COLOR="#FF0000"]free for local variables:-22272[/COLOR]
   RAM2: variables:149280  free for malloc/new:375008
[COLOR="#FF0000"]Error program exceeds memory space[/COLOR]

Older versions end up with various compile errors.

I'm going to keep this message short (not post the errors from other tests) and simply ask for exactly which version of Arduino, Teensyduino, QNEthernet, and freertos-teensy are used to reproduce the crash?

If the answer involves PlatformIO, please reproduce the problem with Arduino IDE build... PIO adds way too other variables into the build process!
 
This still smells like a stack overflow

Could be stack overflow. But the problem is said to cause the bootloader to give 2 blink error. Claims to be unrecoverable (but no mention of trying 15 sec restore process). A "normal" stack overflow shouldn't do this.

Could be stack overflow happening inside a fault or exception handler (FreeRTOS takes over that stuff). Could be somehow trashing memory in a way that disables the JTAG pins.

Or could be some previously undiscovered way of badly crashing the IMXRT chip. Some of these do exist, like the FlexIO not enabled crash (which I believe Kurt and/or Defragster found...) If this really is a new way to crash the hardware, I want to at least add it to MyFault so we don't forget about it, since all such things are undocumented by NXP and probably never get added to the official errata.

I really would like to get to the bottom of *why* it's crashing in such a severe way to cause the 2 blink error. But with such a large program that doesn't even compile with any specific Arduino+Teensyduino and the latest versions of libraries, really can't even begin to really investigate the true cause.
 
Editing CODE4CODE - code that writes repeat code that was testing code and data retrieval on encrypted Lockable T_4.x's. Somehow along the way this happened - stack fault reported.
Not to have this FIXED or addressed - just showing that it is detected:
Code:
CrashReport:
  A problem occurred at (system time) 2:45:6
  Code was executing from address 0x6009A174
  CFSR: 82
	(DACCVIOL) Data Access Violation
	(MMARVALID) Accessed Address: 0x20053B80 (Stack problem)
	  Check for stack overflows, array bounds, etc.
  Temperature inside the chip was 38.87 °C
  Startup CPU clock speed is 600MHz
  Reboot was caused by auto reboot after fault or bad interrupt detected

Not sure what got crossed up there ...

Just updated: github.com/Defragster/T4LockBeta/tree/main/Code4Code

Opps: built this but didn't run it before typing the text following CrashReport: Running this code as is now stack overflows 63KB isn't enough! Would have to rebuild Dual Serial, capture code and rebuild at 900 to see it work for the following to apply - 72+ minutes late to log off ...
> posted the CrashReport as it didn't die at the same place {code or data} - so it shows some suggestion that (Stack problem)'s won't go unnoticed and just HANG or cause '2RED blink.
Test on standard production T_4.0:
Code:
03:10:13.834 (loader): File "Code4Code.ino.hex". 438272 bytes, 22% used
03:10:13.849 (loader): elf appears to be for Teensy 4.0 (IMXRT1062) (2031616 bytes)
03:10:13.849 (loader): elf binary data matches hex file
03:10:13.849 (loader): elf file is for Teensy 4.0 (IMXRT1062)
03:10:13.849 (loader): using hex file - Teensy not configured for encryption
Code:
CrashReport:
  A problem occurred at (system time) 3:13:17
  Code was executing from address 0x16AD0
  CFSR: 82
	(DACCVIOL) Data Access Violation
	(MMARVALID) Accessed Address: 0x200089C0 (Stack problem)
	  Check for stack overflows, array bounds, etc.
  Temperature inside the chip was 40.19 °C
  Startup CPU clock speed is 600MHz
  Reboot was caused by auto reboot after fault or bad interrupt detected

It only runs 1,000 calls calling calls with Code in FASTRUN and data in FASTRUN. It works and has Free/Stack space to spare [NOT], each of the 1000 deep calls use only two DWORDs on stack:
Code:
Memory Usage on Teensy 4.0:
  FLASH: code:405084, data:23984, headers:9200   free for files:1593348
   RAM1: variables:35264, code:400280, padding:25704   free for local variables:63040
   RAM2: variables:12416  free for malloc/new:511872

Using that And Adding some recursive code to calc Factorial of an ever increasing value, with some abusive "filler" stack usage { perhaps calling the "ThisFunc1( 0, seePi( PI_DIGITS, szPi ), &sumPi60dig );" in between would show if stack corruption occurs (when ThisFunc1() does its chained calls as it has stored data compared against) before it faults as above where stack overflow is 'detected' by Crash.
 
To expand on this simple test program, here is a slightly longer version which actually sweeps through reading throughout the memory. Just in case anyone suspects the small program doesn't crash only because it never accesses all of the large ITCM.

Code:
FASTRUN const uint8_t buf[465000] = {} ;

void setup() {
  Serial.begin(9600);
  while (!Serial) ; // wait for serial monitor
  Serial.println("hello world");
  uint8_t c = *(volatile uint8_t *)buf;
  Serial.println(c);
}

void loop() {
  static uint32_t index = 0;
  index = (index + 85237) % sizeof(buf);
  Serial.print("buf[");
  Serial.print(index);
  Serial.print("] = ");
  uint8_t c = *(volatile uint8_t *)(buf + index);
  Serial.println(c);
  delay(50);
}

I remember a few years ago, @Frank B, @defragster and @KurtE was playing with getting more size: https://forum.pjrc.com/threads/5732...ons?p=227539&highlight=OCRAM_START#post227539

If I change your code as a test to add in the code posted in that thread
Code:
/*
  (c) Frank B, 2020
  License: MIT
  Please keep this info.
*/
FASTRUN const uint8_t buf[453000] = {} ;

inline
unsigned memfree(void) {
  extern unsigned long _ebss;
  extern unsigned long _sdata;
  extern unsigned long _estack;
  const unsigned DTCM_START = 0x20000000UL;
  unsigned dtcm = (unsigned)&_estack - DTCM_START;
  unsigned stackinuse = (unsigned) &_estack -  (unsigned) __builtin_frame_address(0);
  unsigned varsinuse = (unsigned)&_ebss - (unsigned)&_sdata;
  unsigned freemem = dtcm - (stackinuse + varsinuse);
  return freemem;
}

FLASHMEM
void flexRamInfo(void) {

#if defined(ARDUINO_TEENSY40)
  static const unsigned DTCM_START = 0x20000000UL;
  static const unsigned OCRAM_START = 0x20200000UL;
  static const unsigned OCRAM_SIZE = 512;
  static const unsigned FLASH_SIZE = 1984;
#elif defined(ARDUINO_TEENSY41)
  static const unsigned DTCM_START = 0x20000000UL;
  static const unsigned OCRAM_START = 0x20200000UL;
  static const unsigned OCRAM_SIZE = 512;
  static const unsigned FLASH_SIZE = 7936;
#endif

  Serial.println(__FILE__ " " __DATE__ " " __TIME__ );
  Serial.print("Teensyduino version ");
  Serial.println(TEENSYDUINO / 100.0f);
  Serial.println();

  int itcm = 0;
  int dtcm = 0;
  int ocram = 0;
  uint32_t gpr17 = IOMUXC_GPR_GPR17;

  char __attribute__((unused)) dispstr[17] = {0};
  dispstr[16] = 0;

  for (int i = 15; i >= 0; i--) {
    switch ((gpr17 >> (i * 2)) & 0b11) {
      default: dispstr[15 - i] = '.'; break;
      case 0b01: dispstr[15 - i] = 'O'; ocram++; break;
      case 0b10: dispstr[15 - i] = 'D'; dtcm++; break;
      case 0b11: dispstr[15 - i] = 'I'; itcm++; break;
    }
  }

  Serial.printf("ITCM: %dkB, DTCM: %dkB, OCRAM: %d(+%d)kB [%s]\n", itcm * 32, dtcm * 32, ocram * 32, OCRAM_SIZE, dispstr);
  const char* fmtstr = "%-6s%7d %5.02f%% of %4dkB (%7d Bytes free) %s\n";

  extern unsigned long _stext;
  extern unsigned long _etext;
  extern unsigned long _sdata;
  extern unsigned long _ebss;
  extern unsigned long _flashimagelen;
  extern unsigned long _heap_start;
  extern unsigned long _estack;

  Serial.printf(fmtstr, "ITCM:",
                (unsigned)&_etext - (unsigned)&_stext,
                (float)((unsigned)&_etext - (unsigned)&_stext) / ((float)itcm * 32768.0f) * 100.0f,
                itcm * 32,
                itcm * 32768 - ((unsigned)&_etext - (unsigned)&_stext), "(RAM1) FASTRUN");

  Serial.printf(fmtstr, "OCRAM:",
                (unsigned)&_heap_start - OCRAM_START,
                (float)((unsigned)&_heap_start - OCRAM_START) / (OCRAM_SIZE * 1024.0f) * 100.0f,
                OCRAM_SIZE,
                OCRAM_SIZE * 1024 - ((unsigned)&_heap_start - OCRAM_START), "(RAM2) DMAMEM, Heap");

  Serial.printf(fmtstr, "FLASH:",
                (unsigned)&_flashimagelen,
                ((unsigned)&_flashimagelen) / (FLASH_SIZE * 1024.0f) * 100.0f,
                FLASH_SIZE,
                FLASH_SIZE * 1024 - ((unsigned)&_flashimagelen), "FLASHMEM, PROGMEM");

  // Serial.println();
  unsigned _dtcm = (unsigned)&_estack - DTCM_START; //or, one could use dtcm * 32768 here.
  unsigned stackinuse = (unsigned) &_estack -  (unsigned) __builtin_frame_address(0);
  unsigned varsinuse = (unsigned)&_ebss - (unsigned)&_sdata;
  unsigned freemem = _dtcm - stackinuse - varsinuse;
  Serial.printf("DTCM:\n  %7d Bytes (%d kB)\n", _dtcm, _dtcm / 1024);
  Serial.printf("- %7d Bytes (%d kB) global variables\n", varsinuse, varsinuse / 1024);
  Serial.printf("- %7d Bytes (%d kB) stack (currently)\n", stackinuse, stackinuse / 1024);
  Serial.println("=========");
  Serial.printf("  %7d Bytes free (%d kB), %d Bytes in use (%d kB).\n",
                _dtcm - (varsinuse + stackinuse), (_dtcm - (varsinuse + stackinuse)) / 1024,
                varsinuse + stackinuse, (varsinuse + stackinuse) / 1024
               );
}


void setup() {
   while (!Serial && millis() < 4000);  

  Serial.begin(9600);
  while (!Serial) ; // wait for serial monitor
  Serial.println("hello world");
  uint8_t c = *(volatile uint8_t *)buf;
  Serial.println(c);
   flexRamInfo();
}

void loop() {
  static uint32_t index = 0;
  index = (index + 85237) % sizeof(buf);
  //Serial.print("buf[");
  //Serial.print(index);
  //Serial.print("] = ");
  uint8_t c = *(volatile uint8_t *)(buf + index);
  //Serial.println(c);
  delay(50);
}
It shows:
Code:
hello world
0
D:\Users\Merli\Documents\Arduino\sketch_jul22b\sketch_jul22b.ino Jul 22 2023 08:49:58
Teensyduino version 1.59

ITCM: 480kB, DTCM: 32kB, OCRAM: 0(+512)kB [DIIIIIIIIIIIIIII]
ITCM:  486352 98.95% of  480kB (   5168 Bytes free) (RAM1) FASTRUN
OCRAM:  12416  2.37% of  512kB ( 511872 Bytes free) (RAM2) DMAMEM, Heap
FLASH: 504832  6.21% of 7936kB (7621632 Bytes free) FLASHMEM, PROGMEM
DTCM:
    32768 Bytes (32 kB)
-    6944 Bytes (6 kB) global variables
-      96 Bytes (0 kB) stack (currently)
=========
    25728 Bytes free (25 kB), 7040 Bytes in use (6 kB).
If I had to change the size of your buf, otherwise it keeps reconnecting.
 
Funny thing is this piece of the puzzle causes restarts:
Code:
extern "C" {
void startup_late_hook(void) {
  extern unsigned long _ebss;
  unsigned long * p =  &_ebss; 
  size_t size = (size_t)(uint8_t*)__builtin_frame_address(0) - 16 - (uintptr_t) &_ebss;
  memset((void*)p, 0, size);  
}
}

unsigned long maxstack() {
  extern unsigned long _ebss;
  extern unsigned long _estack;
  unsigned long * p =  &_ebss;  
  while (*p == 0) p++;
  return (unsigned) &_estack - (unsigned) p;
}
 
Sorry, my eyes sort of glaze over when I try to read through threads, like this where it feels like the exact same thing as the previous threads.

Especially when I don't see anything new, like I tried the following things mentioned in post X and the results were Y... Or I tried Z to see if I could help localize it...

As I mentioned in the previous thread, or was that the previous thread to the previous thread, I am wondering if you are barking up the wrong tree...

Like some others have mentioned, I am guessing that you are getting some form of stack corruption. Or potentially it could be something like the usage of an uninitialized variable and what might be in it, depending on the code... Could be variable index out of range...

For example: Had sketch, who wiped out a variable that was used as the size of something. And then there was a call that did something like: memset(some_addr, 0, size_var);
where the size var was something like 0xffffffff. Needless to say it did not work... CrashReport failed in this case. Note: It may have been memcpy or memove call...

So in that case, was able to capture the happening. What I did was to turn on the core debug output on Serial4.
Teensy4/debug/printf.h
uncomment the line: //#define PRINT_DEBUG_STUFF

And rebuild. Have some form of USB to Serial connected up to Serial4 pins... At least the TX
The system will start up the Serial4 early on at 115200 baud rate. It's output is primitive. in that it will output the characters directly
using the hardware UART registers.


This will print out some lower level debug data at sketch startup. That is everything in the core that has printf(xyz) will be output.

At the time I added some printf statements in: void unused_interrupt_vector(void)
And confirmed it was hit. At the time I printed out the similar stuff to what CrashReport printed later...
Confirmed that the address was in one of the locations like memset, which did not help much. So I had the code dump some of the stack from that area, and then looked at the different things in it which looked like addresses and finally found who was calling it...

You might try something like that. At least the basics, to see if it is crashing and where.

As I mentioned in the previous post, we have some debug stuff, which tries to see how much of the stack area has been used. One version of it is in the ST7789_t3 example code for the eyes with large displays... I believe there was also a library setup that did this as well, don't remember for who, but guessing FrankB? So I would then try seeing if this helps localize... Or at least points out there is an issue.
 
KurtE said:
As I mentioned in the previous post, we have some debug stuff, which tries to see how much of the stack area has been used. One version of it is in the ST7789_t3 example code for the eyes with large displays... I believe there was also a library setup that did this as well, don't remember for who, but guessing FrankB? So I would then try seeing if this helps localize... Or at least points out there is an issue.

The library was called T4_PowerButton. But Frank B has since archived that library and removed all the memory info stuff. What I posted is basically whats was in Frank's library except for the maxstack info since its causing the T4 to crash on me. And yes what you said is right.
 
Funny thing is this piece of the puzzle causes restarts:
Code:
extern "C" {
void startup_late_hook(void) {
  extern unsigned long _ebss;
  unsigned long * p =  &_ebss; 
  size_t size = (size_t)(uint8_t*)__builtin_frame_address(0) - 16 - (uintptr_t) &_ebss;
  memset((void*)p, 0, size);  
}
}

unsigned long maxstack() {
  extern unsigned long _ebss;
  extern unsigned long _estack;
  unsigned long * p =  &_ebss;  
  while (*p == 0) p++;
  return (unsigned) &_estack - (unsigned) p;
}

We ran into that earlier...
Later on 16 was not enough of a buffer, so went to 32.
Code:
...
  for (uint32_t *pfill = (&_ebss + 32); pfill < (&itcm_size - 10); pfill++) {
    *pfill = 0x01020304;  // some random value
  }
#endif
}
void EstimateStackUsage() {
#if defined(__IMXRT1062__) && defined(DEBUG_MEMORY)
  uint32_t *pmem = (&_ebss + 32);
  while (*pmem == 0x01020304) pmem++;
  Serial.printf("Estimated max stack usage: %d\n", (uint32_t)&_estack - (uint32_t)pmem);
#endif
}
 
@tjaekel - If you're still watching this thread, please answer my question in msg #7.

I tried to compile your program. But it does not compile with any version of Arduino IDE + Teensyduino and the latest QNEthernet and freertos-teensy. Maybe you're using an older version of those libraries? Or if this is PlatformIO (tends to add a lot of variance in how the compiler is run) please test with Arduino IDE so I or anyone else can consistently get the same result you have seen.

Please say *EXACTLY* how to reproduce the specific crashing problem (2 blinks error from the bootloader) with the code you've shown.
 
Thank you, Paul,
for your effort to dive so deep into this issue. I appreciate your great support, you are doing a great job.
(I was not available the last few weeks, I am sorry for late response).

Comments and more details:
To get it compiled - you have to modify the linker script!:
the .rodata (all constant data and strings) are moved to .progmen.!
I need the RAM space for other stuff (e.g. Pico-C scripts).
The modified linker script is now part of the GitHub project (just copy over).
It has this change:
Code:
.text.progmem : {
		*(.progmem*)
		*(SORT_BY_ALIGNMENT(SORT_BY_NAME(.rodata*)))	/* this works */
		. = ALIGN(4);
	} > FLASH
(see the .rodata now on FLASH, not DTCM anymore).

Updated GitHib:
https://github.com/tjaekel/Teesny_4_1

You gave an example, I guess, to define a huge data array (in DTCM, e.g. more as 256KB):
as long it is never used neither accessed (e.g. not initialized to zero during startup) - all looks fine (compile clean and it seems to run properly).

My project on GitHub has now a demonstration what happens if you generate too much code.
And I can still confirm: if you overflow the 256K ITCM with instructions - it crashes during runtime.

When you define in Linker Script as:
Code:
	ITCM (rwx):  ORIGIN = 0x00000000, LENGTH = 256K
you get an error (FAILS on compile), because you generate more as 256K code.

But when you keep going with the original setting, as:
Code:
	ITCM (rwx):  ORIGIN = 0x00000000, LENGTH = 512K
it compiles clean (no errors), results in a report telling me: I have now 288K on ITCM (no idea why 288K now),
it crashes, immediately (after UART is connected: I get a double flashing orange LED and nothing works anymore).

It is just one single __NOP() instruction more to generate in code. One less OK - one more FAILS.
And this issue is "hidden" when ITCM is set to 512K (compiles OK, but CRASHES).
With 256K ITCM (as the FW will configure - I get the error during compile time) - reasonable.

How to replicate:
Do similar stuff as I do in my command "test":
- define NOP instructions and sequences of it and increase ITCM usage until it fails
- modify the linker script to use 256K vs. 512K ITCM space available (and see the difference on compile and runtime behavior)
==> see the difference (still compile clean but crashing with just one more __NOP() )

Here how I have forced the issue:
Code:
void __nop(void) {
  asm("nop");
}

#define NOP10 __nop();__nop();__nop();__nop();__nop();__nop();__nop();__nop();__nop();__nop();
#define NOP100 NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP10
#define NOP1000 NOP100 NOP100 NOP100 NOP100 NOP100 NOP100 NOP100 NOP100 NOP100 NOP100

void testCode(void) {
  print_log(UART_OUT, FSTR("helllo from flash\r\n"));
  NOP1000
  NOP1000
  NOP100
  NOP100
  NOP100
  NOP100
  NOP100
  NOP100

  NOP10
  NOP10
  NOP10
  NOP10
  NOP10
  NOP10

  __nop();
  __nop();
  __nop();
  __nop();
  __nop();
  __nop();
  __nop();
  __nop();
  /* linker script: ITCM 256K - OK, works */

/* change it to #if 1 and 512K ITCM in linker script - FAILS!
 * use in linker script 256K ITCM - it FAILS on comnpile - correct!
 * compile clean with #if 1 and 512K ITCM - but it CRASHES after UART connected!
 */
#if 0         /* #if 1 - FAILS !!!!! */
  /* linker script: ITCM 256K - FAILS (on compile), too much code! - OK (correct)
   * change linker script to ITCM 512K - OK on compile, on startup it crashes! (after UART connected)
   * it reports 288 KB of code for ITCM (WRONG! - cannot work)
   */
  __nop();
#endif
}

Just increase step by step until the 256K ITCM overflows.
With 256K ITCM in Linker Script - you will see when it should start to fail (compile/linker error, check how much is left for ITCM).
When you change Linker Script to 512K - it compiles fine (no errors, is flashed), but on startup it fails.

For me "obvious":
the startup code copies now code from external flash to ITCM memory. But just 256K ITCM is available.
The last instruction to copy is now outside the (configured) memory region: it creates a "bus trap" (even the MPU config would allow: you try to write to non-existing memory: bus trap).

Never mind, all up to you.
I just do not understand, why Linker Script set to 512K reports now 288K ITCM code size, just by one single additional __NOP() instruction more.
Just changing one __NOP() instruction more or less gives me a "strange" code size difference.

It works fine for me (and safe), when I set ITCM and DTCM to 256K each - as it is configured in FW startup.

---------------
All the other issues, like:
"let's have 512K each for ITCM as well as DTCM", "keep the MPU config as 512K for each region" ...
I understand:
"if you want to modify later the split of the memories (in FW code) - all fine, e.g. 512K ITCM (and no DTCM) - no other changes to follow"
It is still OK with MPU config and Linker Script (even it will generate "wrong" code (allocations) - which I am concerned here: to get wrong compile clean reports,
not fitting to the FW startup/config).

I am fine, it works for me with "my corrected" Linker Script (256K ITCM, 256K DTCM, as configured on startup).

BTW:
The crash report generated seems to be wrong:
I tried to use a char * pointer, which was NULL.
OK, the MPU is configured in a way to catch any access to address 0x00000000 + 32bytes and stack bottom - 32bytes.
Yes, MPU has seen my read access to address NULL, but the crash report about where the causing read instruction was executed (and would located, in *.LST file) was completely wrong
(reporting a bug in a FW function, not in my function).

BTW: my issue is not a stack overflow issue:
it is too much code generated for (256K) ITCM memory!

Never mind:
great job done by you, a great board (except the lack of a real HW debugger), great support provided by you.

Good bye
(changing to a MaaxBoard RT now...)
 
I just do not understand, why Linker Script set to 512K reports now 288K ITCM code size, just by one single additional __NOP() instruction more.
Just changing one __NOP() instruction more or less gives me a "strange" code size difference.
The ITCM is allocated in 32K pages. So 256K+32K=288K. This is what the "padding:xxxx" is referring to in teensy_size output
 
As noted - the PJRC Linker script is not the problem - as done for Teensy with 1062's it works well and good to offer full 512KB TCM for use with a dynamic split { ITCM and DTCM } as the program dictates with 32KB granularity - and constrained by the processor design.

The same will exist - twice over in some fashion - on the 1176:
2MB of Fast On-Chip SRAM
(includes 512KB of A7 TCM and 256KB M4 TCM)

avnet.com/ ... /maaxboard/maaxboard-rt/

Bummer the PJRC work on RT1176 got waylaid by cascading pandemic issues ... Might not have had all those features - but certainly not $150 for a Teensy version.
 
How to replicate:

You still haven't answered my questions from msg #7. Until you say exactly which versions of Arduino IDE and Teensyduino and QNEthernet and freertos-teensy are used, I'm going to consider this problem to not be reproducible.

I did spend time trying to compile your program with several versions of Teensyduino and Arduino IDE. Your code simply does not compile with the PJRC published packages.

Just to be clear, I will NOT investigate usage on PlatformIO (using packages published by PIO developers, which have a long history of problems as anyone can see from several recent threads on this forum). I will not investigate a problem which happens with a modified linker script.

I will only investigate if the problem is reproducible with Arduino IDE and a specific version of Teensyduino, which uses the toolchain and linker script PJRC publishes in the Teensyduino package.
 
I just skimmed this thread, but I don't think anyone felt the need to explain things to him.

it is too much code generated for (256K) ITCM memory!

Tjaekel, I think you just have a basic misunderstanding:
The CPU allows a flexible configuration, i.e. division of the memory area of interest of 512KB. Nowhere is specified that the 512K must divide into 256K for ITCM and 256K for data.


The linker does not know about this, or has no possibility to specify this.
No matter which sizes you specify there - they are always "wrong" - and especially if you would specify less than 512K. Because every specification would be too inflexible. There may be programs that need only 32K RAM (for data), but 480K code. It does not make sense to write a separate script for each case.

Therefore, the only feasible way was taken here, and the maximum size was specified for both. The linker can't do it any other way. Because the MCU must be configured in this respect AT RUN TIME.

Further down in the linker script the necessary ITCM size is calculated. This can only be defined in blocks of 32k.
Ergo the size of ITCM (DTCM, too) is always a multiple of 32KB. The DTCM size is then configured to use the then remaining memory of available 512KB.

This is not done by the linker - because it simply CAN'T do it. Code is needed. In the startup code, this memory allocation is then set according to the above calculation - and voilà - all is well - and correct.

You could now object that the linker can no longer determine whether the 512 KB are sufficient for ITCMD + DTCM. That is absolutely correct. But for this there is the additional Teensy-Size- Tool, which checks the sizes accordingly and outputs errors if necessary.


Good bye
(changing to a MaaxBoard RT now...)

Unfortunately, if you don't read the manuals and further documentaion for that either, you won't have more success with it.
But, as a system engineer at a chip design company, you know that.
 
Last edited:
[...]
My project on GitHub has now a demonstration what happens if you generate too much code.
And I can still confirm: if you overflow the 256K ITCM with instructions - it crashes during runtime.
[...]

FYI, a simple program using more than 470 kB of ITCM that compiles (Arduino IDE 2.1.1 and Teensyduino 1.58.1) and runs correctly:

Code:
FASTRUN uint8_t placeholder[440 * 1'024]; // increase ITCM memory usage

void setup() {
    Serial.begin(0);
    delay(3'000);

    Serial.println("\r\n\nSTART\r\n\n");
    delay(1'000);
}

void loop() {
    static size_t col, row;

    Serial.print("*");

    if (++col == 64 && placeholder[0] == 0) {
        Serial.println();
        Serial.printf("%8u: ", ++row);
        col = 0;
    }
}

void loop() is located at 0x0006e0ec which is above 256 kB of ITCM.
It's pretty clear that the original linker script does the right thing for a 1062, but just in case someone prefers to the have a running example.
 
Linker Script IS CORRECT!
as well as the MPU config.

I take it back.
Linker Script has just a "risk".

Details:
Linker Script, "imxrt1062_t41.ld":
Code:
	_flexram_bank_config = 0xAAAAAAAA | ((1 << (_itcm_block_count * 2)) - 1);
file "startup.c":
Code:
void ResetHandler(void)
{
	IOMUXC_GPR_GPR17 = (uint32_t)&_flexram_bank_config;
MCU reference manual:
31.4.2 RAM Bank Allocation, page 1896
11.3.18 GPR17 register, page 363

What happens? (Why does my project fail, "the risk")
A bit more of code results in: the next 32K ITCM bank is allocated.
This additional ITCM bank reduces the DTCM also by one bank (in 32K steps).
If DTCM size is reduced - it reduces also the size for the stack.
So, a bit more code - and the stack size available decreases "suddenly" by 32K less.

There is NOT a check for "minimum stack size required". ("the risk")

What to do?
Have a clue about how many stack size is needed by your FW, when the project is running (during run-time).
Check carefully the compile report, esp. the "free for local variables:xxxx" - is it reasonable enough?

Be aware of: "an increase of code size might reduce the available stack size" ("the risk").
As in my case: a bit more code added - it crashes:
It is not the code added - it is the "risk" that stack gets now too small (and during runtime you get a stack overflow, or accessing outside available memory regions...).

Remark:
I use "freertos-teensy" in my project. It allocates the task stacks on the MCU stack. So, it "steals" stack size:
even "free for local variables:xxxx" looks large, the freertos grabs a lot from it - no idea how much is left really for local variables.
This CANNOT be seen neither reported by the compiler and linker script. How much of stack will be "pre-allocated" due to tasks/threads? Get a clue how much stack size is allocated,
by third party code (freertos), by the FW code used...

How to solve?
In my case: I had to free some memories in order to get more space on DTCM (with the stack, "increase the available stack size again").
You can do by:
move .code, .data, .rodata (const data and strings) to other memories.
Use the __attribute__(()) macros to "reorganize the memory layout".
Use FLASHMEM (or PROGMEM) to move code away from ITCM (make more space for DTCM and stack on it).
Use DMAMEM to move data away from DTCM (make more space for DTCM...).
Even, possible to move some data, strings... to ITCM - FASTRUN:
Yes, it works to place also data into ITCM (even R/W).
You can also move const data and strings to FLASHMEM:
Code:
.text.progmem : {
		*(.progmem*)
		*(SORT_BY_ALIGNMENT(SORT_BY_NAME(.rodata*)))	/* this works */

But: be aware of the "side effects", such as:
DMAMEM is NOT initialized! And the speed there is different, can affect the runtime performance (e.g. FLASHMEM
is slower, DMAMEM is slower, even it uses DCache and might "compensate" - cache maintenance might be required to do).

When a bit more code allocates the next 32K for ITCM (and reduces DTCM by 32K) - most of the ITCM 32K bank is not used (see "padding" on compile report).
You can still make use of it by moving some data into ITCM (away from DTCM). Just: a data access in ITCM blocks now a bit the code fetch from there.

How to make sure I have "minimum of required DTCM" (and stack size)?
You can modify the linker script. Change one of both lines in linker script "imxrt1062_t41.ld":
Code:
ITCM (rwx):  ORIGIN = 0x00000000, LENGTH = 512K
DTCM (rwx):  ORIGIN = 0x20000000, LENGTH = 512K
for instance into:
Code:
ITCM (rwx):  ORIGIN = 0x00000000, LENGTH = 256K
DTCM (rwx):  ORIGIN = 0x20000000, LENGTH = 256K

This makes sure, you have at least 256K for DTCM. If your data size increases now beyond this "minimal DTCM size" - you get a compile error.
Based on the report you can see now, if your code overflows, or too much data.
Still watch carefully the "free for local variables:xxxx" (telling you how much stack remains available).

This avoids to get trapped by "DTCM is automatically !! decreased" (and stack size in particular) by adding additional code.
(with 512K each - it is a "dynamic" allocation during runtime: "out of stack" is just realized by a crash, when, where and how it crashes - a bit unpredictable)

Personal Comments
1. I think, I tried to talk about technical issues and not to offend people, not using an offensive style, esp. not attacking people personally.
All the comments were helpful, even it took a while for me to get it (I am sorry), to find my root cause (which is at the end: "too much code").
Thank you, esp. to "Mcu32 for his kind words".

2. The Teensy 4.1 board is great, the support is great (just the forum sometimes a bit rude).
The board and IDE, FW, works fine.
For my huge project, my requirements (in terms of memory space, performance, code size) - I "have" to divert to another board.
Nothing else! Not because Teensy 4.1 is "bad" - it is NOT! - just because my requirements make it tough to implement on this board
(why so aggressive with my decision? It was just to tell you: "I might not be so active here anymore").

All fine. I found the root cause for my issue, even with the need to "brainstorm with you here". Thank you.
 
Back
Top