Teensy 4.1 : Linker Script correct? MPU config? DMAMEM?

Status
Not open for further replies.

tjaekel

Well-known member
Moderator Edit: linker script is fine, see this message for info.


I think (actually, I am sure) - the Teensy 4.1 Linker Script is NOT correct:

I extend my project, I shuffle code and data around (e.g. FASTRUN, DMAMEM, regular DTCM RAM...),
the code starts crashing, even it was working before, just place on a different memory, no changes on code line.
And it can crash immediately on startup of my code - very tough to recover from this situation!

Reason:
The Linker Script does not seem to be correct.
It can generate code and data access outside the available memories.

Details:
The linker script "imxrt1062_t41.ld" has this definition:
Code:
MEMORY
{
    ITCM (rwx):  ORIGIN = 0x00000000, LENGTH = 512K
    DTCM (rwx):  ORIGIN = 0x20000000, LENGTH = 512K
    RAM (rwx):   ORIGIN = 0x20200000, LENGTH = 512K
    FLASH (rwx): ORIGIN = 0x60000000, LENGTH = 7936K
    ERAM (rwx):  ORIGIN = 0x70000000, LENGTH = 16384K
}

So, it means: you have actually 512K ITCM and 2*512K for data (DTCM + RAM) = 1.5 MB of internal RAM.
As I understand the NXP RT1062 datasheet:
it has total 1 MB (2x 512K) internal - not 1.5MB!
The RAM1 can be configured as "split" memory for ITCM and DTCM:
so, the 512 KB can be split into ITCM and DTCM, so both together in total 512K - not 1 MB!

This results in this issue:
I can generate up to 512K code (for ITCM) plus 512K data (for DTCM).
The Linker will not complain, because it was "told" to have 512K each.
BUT NOT TRUE!
The MCU has just 512 total for ITCM PLUS DTCM, not 1M. So, code or data or both become located outside a valid memory region in MCU.
This must crash (with a Bus Error or Hard Fault) - and it does.

Why 2*512K for ITCM and DTCM?
When I see the code for the MPU configuration, which is not really correct (but it is not the root cause), it configures also
512K for ITCM, 512K for DTCM, 512K for DMAMEM (RAM2).
Maybe, the Linker Script config was set in correlation with the MPU config (in effect), even the MPU config is not really correct
(it should have 256K ITCM, 256K DTCM regions, instead).

MPU Config and DMAMEM
I thought, "DMAMEM" means: this memory is intended for DMA operations, buffers, used by DMAs etc. It should be "coherent":
no need for cache maintenance, a DMA can use and update this memory without "coherency issues".

But the code I saw for PMU config tells me:
this region (RAM1), is configured as WBWA. I have assumes as "not cached" or WT.
This WBWA tells me:
you have to use cache maintenance operations, like Clean and Invalidate, before and after a DMA.

I have no clue if my SPI in DMA does it. It seems to work, so I guess, there is cache maintenance.
If you implement your own DMA, using DMAMEM - make sure to use Clean and Invalidate in relation with running a DMA.

Why is const data on DTCM?
It is a bit annoying, that all Read-Only, constant data, e.g. const data structures, const strings ... are all placed as well on DTCM RAM
(the memory for high speed data access).
It blows up my DTCM (default) memory for all data, even all const go there.
I though, const goes into a Read-Only memory.
OK: I found, this "feature" is documented (but still unclear why this way and a bit annoying to have it this way).

I do not see a reason why to have it this way. It reduces at the end the memory for my read-write data available.
And the ITCM memory region might have still enough free space to keep my const data there.

BTW: you can use ITCM (intended for code), via FASTRUN, also as data memory. It works!
Data can be located on ITCM, even write-able.

BTW2: It does NOT work to move *.rodata to ITCM (assuming, it cannot be/is not initialized during startup)
It works if you move *.rodata to FLASHMEM (or PROGMEM, the same).

Modification of linker script "imxrt1062_t41.ld":
Code:
.text.progmem : {
        *(.progmem*)
        *(SORT_BY_ALIGNMENT(SORT_BY_NAME(.rodata*)))    /* this works */
        . = ALIGN(4);
    } > FLASH

.text.itcm : {
        . = . + 32; /* MPU to trap NULL pointer deref */
        *(.fastrun)
        *(.text*)
        /* *(SORT_BY_ALIGNMENT(SORT_BY_NAME(.rodata*))) /* - does not work! */
        . = ALIGN(16);
    } > ITCM  AT> FLASH

.data : {
        *(.endpoint_queue)   
        /* *(SORT_BY_ALIGNMENT(SORT_BY_NAME(.rodata*))) /* - don't have const on data memory */
        *(SORT_BY_ALIGNMENT(SORT_BY_NAME(.data*)))
        KEEP(*(.vectorsram))
    } > DTCM  AT> FLASH

Conclusion
Very strange to see this discrepancy between Linker Script and physical features of MCU (as: 1 MB total internal RAM available, not 1.5 MB).
If not realized and fixed yet - it "tells" me:
nobody has ever created a large project (with more as 512K code and 512K data), or nobody has tested such a large project.

Due to this "incorrect" linker script - you do not get any warning that your code or data size is too large.
Instead: you will just realize when the code crashes during runtime, worst case: randomly depending which code or data "outside of available memories"
is invoked.
Maybe people have been trapped into this issue, seeing their project is crashing, when they extend their project or reorganize the memory use/locations.

I think, the Linker Script should be fixed to this definition:
Code:
MEMORY
{
    ITCM (rwx):  ORIGIN = 0x00000000, LENGTH = 256K
    DTCM (rwx):  ORIGIN = 0x20000000, LENGTH = 256K
    RAM (rwx):   ORIGIN = 0x20200000, LENGTH = 512K
    FLASH (rwx): ORIGIN = 0x60000000, LENGTH = 7936K
    ERAM (rwx):  ORIGIN = 0x70000000, LENGTH = 16384K
}
(see the 256K there)

The MPU config could be fixed as well, but it should be fine to leave it as it is
(as long as no code is generated which could access outside available memories - as it is possible right now!)
 
Is this building under Arduino? PJRC build handles overflow of shared 512KB between the two elements that Share TCM:
ITCM (rwx): ORIGIN = 0x00000000, LENGTH = 512K
DTCM (rwx): ORIGIN = 0x20000000, LENGTH = 512K

Declaring them each 256K would be limiting and wrong and it is not how it works with PJRC toolset as published and implemented for the 1062 MCU's.

It is done this way as All code can be put into FLASH and that would allow DTCM to occupy that shared TCM area.
Likewise if no data was used in the DTCM all the TCM code could be allocated to ITCM.

obv. ALL is extreme - but the way the 1062 is set to work with PJRC TeensyDuino is 512KB of TCM is split between I and D where Instructions get what they need in 32KB blocks, and the remainder of that shared 512KB is then available for Data.

Not sure where - but in the build process supported by PJRC this is enforced and if the two halves end up consuming over 512KB combined it breaks the build with a note to that effect.

Teensy_size.exe is part of the build and it prints this information after the results are known:
FLASH: code:97372, data:13564, headers:8868 free for files:8006660
RAM1: variables:16736, code:91544, padding:6760 free for local variables:409248
RAM2: variables:6272 free for malloc/new:518016

Yes, 'const' is not enough for the build to leave that data on FLASH without explicit PROGMEM.

There is NO cache on any TCM memory as it runs natively at full processor speed.
 
If you look at the memory map diagram on the PJRC site, you can see what's actually going on.

ITCM and DTCM are both 512KB, and while they have different base addresses, they both map to the same physical 512KB of Tightly Coupled Memory (TCM). In practice, the sum of ITCM (Instruction TCM) + DTCM (Data TCM) is 512KB, but the split is dependent on how you decide to use it. The second block of 512KB is (slightly slower) RAM, so the total amount of "RAM" in T4.x is 512KB + 512KB = 1MB.

EDIT:
Conclusion -- nobody has ever created a large project (with more as 512K code and 512K data), or nobody has tested such a large project.
I'm struggling to find a way to gently say that a little humility would go a long way.
 
I use Arduino IDE, 2.1.1 with Teensydruino, V1.58.1

"Declaring them each 256K would be limiting and wrong" - NO - it would be the correct way!
MCU does not have 3x 512K memories! (I am sure)

The other stuff with 'const', the "no caches on TCM" ... I am aware off and know.
(intending to post a separate info thread, PSTR becomes a bit "strange")

My compile results look like this - after changing to 256 ITCM and DTCM each (which I think is correct):
Code:
Memory Usage on Teensy 4.1:
  FLASH: code:270504, data:88660, headers:8448   free for files:7758852
   RAM1: variables:178592, code:256712, padding:5432   free for local variables:83552
   RAM2: variables:145152  free for malloc/new:379136

It makes sense now:
RAM1 has total 524,288 bytes = 512K (both ITCM and DTCM)
But as you see:
code (as ITCM) is already 256,712, so short before max. of 262.144 (256K).
But "free for local variables" tells me: 83552.

If I increase now a buffer size, even not used yet when FW starts, by an additional 64KB size - it crashes immediately on startup
(when UART is connected, waiting for).

So, I cannot trust the "free" space information!
If I increase just a buffer by the remaining "free" - the compile, linker is still OK, no overflow.
BUT IT CRASHES NOW!
The "free" does not really mean you can make use of it, e.g. increase a buffer by this "free" amount: it crashes now.
 
NO - they are not "both 512K"!
The datasheet says: the 512K can be shared, can be split, into ITCM and DTCM.
So, both, ITCM PLUS DTCM have max. 512K, but not 512K each.

Yes, it depends, how do you configure the split. But it remains as: 512K in total for both (not 1M).

Yes, ITCM + DTCM + RAM2 (DMAMEM) = 1M,
but not 1.5M as seen in linker script (which generates code for 1.5M internal memory).

I am quite at the end of maximum memory available, code is almost 256K, data is about 190K.
If I increase a buffer size by additional 64K ("free" tells me 83552 is still left) - all compile is clean -but FW crashes!

Something seems to be wrong: the linker script looks wrong for me.
 
I use Arduino IDE, 2.1.1 with Teensydruino, V1.58.1
...

Perhaps the build process of IDE 2.1.1 is giving bad results. Plese provide a repro case with unaltered installed TD 1.58.1 cores.

This has been working as designed and expected for years - and the SHARED 512 KB TCM memory is properly handled, and errors when violated.

Perhaps something IN IDE 2 isn't right or the code at hand is faulty.

But both ITCM and DTCM target the same 512KB on a sliding division as needed. Setting either to a lower fixed value is a waste of resources.
 
Sorry, I am lost.
Not an IDE issue, the linker script (or runtime code) is wrong.

Sure, if nobody has ever generated project with more as 1M code and data (and it would compile clean) ...

I think about: "Setting either to a lower fixed value is a waste of resources"
NO, it is already a waste of resources:
if I have few code but a lot of data - data memory (DTCM) will not be large enough to keep data, but ITCM has still a lot of free space.

Not need to change the "split":
ITCM can also keep data, just define variables with FASTRUN.
When I see, my DTCM overflows, I can still relocate some buffers, variables ... to ITCM (code memory), via FASTRUN as attribute.

I am OK, with this 256K plus 256K split, I can still reshuffle data to ITCM.
My concern is just:
if linker script allows to generate and place code and/or data "outside" available memory regions - it might crash during runtime.
 
Are you talking about having "aliasing"?
Maybe, but it does not make sense:
if a data location shares the same address like a code location (due to aliasing), it will destroy my code.

I think, the Teensy startup is configured as:
ITCM = 256K
DTCM = 256K
but this is not reflected correctly in linker script.
 
Are you talking about having "aliasing"?
Maybe, but it does not make sense:
if a data location shares the same address like a code location (due to aliasing), it will destroy my code.

I think, the Teensy startup is configured as:
ITCM = 256K
DTCM = 256K
but this is not reflected correctly in linker script.

ITCM and DTCM are the SAME physical memory. In other words, when the processor executes an instruction at 00000000, or accesses data at 20000000, it is accessing the same memory location. Either ITCM or DTCM can be as much 512K, but the TOTAL is limited to 512K, so all of these are possible:

ITCM = 0 DTCM = 512
ITCM = 512 DTCM = 0
ITCM = 256 DTCM = 256
 
Prior posts by @defragster and @joepasquariello are correct in practice and use as implemented.

The TCM memory addressable area is 512KB and runs at full processor speed.

How the sketch is built determines how much is allocated to ITCM code and that is copy from flash in PJRC restart code.

Rounded up to a complete 32KB block the remainder is available to DTCM for data use and the build and processor make this work.

The build uses one address for CODE and another for DATA and the processor resolves this to properly run:
Code:
	ITCM (rwx):  ORIGIN = [B]0x00000000[/B], LENGTH = 512K
	DTCM (rwx):  ORIGIN = [B]0x20000000[/B], LENGTH = 512K

If ever the two combine to over 512KB the build breaks as this would be unusable.

There is WASTE in the ITCM noted as padding where the final ITCM block of 32KB may not be filled, but not avaialble to the start of the data area.

The PJRC T_4.0 and T_4.1 have details on the memory and that evolved from forum posts back to Beta and has not changed as it works to allow full use of 512KB TCM memory space - but no more.
 
Sure, internally, inside the chip, the same memory (512K in total).
But fetching code as 0x0000000X cannot be the same as reading code from 0x2000000X.
I think, the "split configuration" sets a "memory mapping", so, when divided into 256K + 256K, the read/write on address 0x2000000X is
at 256K + X (not the same memory location!)

Back to the issue:
Do you think, that 512K + 512K + 512K in linker script is correct?
(I do not think so)
Do you want to say, that 512K ITCM + 512K DTCM is correct?
(based on your response - you might also disagree)
 
Indeed, in that regard the linker script has been performing perfectly for years in conjunction with the 1062 MCU function and features.

Telling the tool chain that both ITCM and DTCM have full access to the 512K is 'under the covers' and this is correct.

The only problem is detected later and that is when Code + Data end up using a combination over 512K.

> One problem not detected until runtime is when the available DTCM RAM for the Stack is insufficient - but this is a normal issue.

P#1 suggested a situation was found where the tools did not indicate a failed build and provided a compromised HEX file for upload?

If this is the case then it would need to be presented for repro as it has not been a problem before as the build enforces the 512KB limit on usage of TCM memory space.

At a minimum first step would be showing the verbose console build output completed and this information was presented with an error allowing a build to complete:
Code:
Memory Usage on Teensy 4.1:
  FLASH: code:40704, data:9164, headers:8496   free for files:8068100
   RAM1: variables:10496, code:35944, padding:29592   free for local variables:448256
   RAM2: variables:12416  free for malloc/new:511872
 
where I am wrong?

To say "wrong" would be overly dramatic or judgemental. But you might be lacking understanding of the way FlexRAM partitioning works, and then reaching conclusions based assumptions of more traditional hardware.

Before you continue this conversation, please read chapter 31 of the reference manual, starting on page 1783. In particular, pay attention to "RAM Array portioning" mentioned in the features list on page 1782, and section 31.3.2 "RAM Bank Allocation" on page 1784. Also pay close attention to the IOMUXC_GPR_GPR16 and IOMUXC_GPR_GPR17 registers documented on page 362-363. Chapter 31 alone could leave you with the false impression the fuses are the only way to configure FlexRAM partitioning, so it is critically import to notice the FLEXRAM_BANK_CFG_SEL bit in IOMUXC_GPR_GPR16, on page 362.



NO - they are not "both 512K"!

Of course not. But with the way FlexRAM partitioning works, either *could* be up to 512K (or 480K, as neither can really be zero for any practical program).

The linker wasn't designed for this sort of configurable hardware. So we have a sort of chick-and-egg problem. We can't specify the final size of ITCM and DTCM in the linker script memory section, because the linker has not yet run to actually determine the required size. The maximum possible size needs to be given to allow the linker to work.

The actual size is determined later. The important part of the linker script you're missing is these 3 lines:

Code:
        _itcm_block_count = (SIZEOF(.text.itcm) + SIZEOF(.ARM.exidx) + 0x7FFF) >> 15;
        _flexram_bank_config = 0xAAAAAAAA | ((1 << (_itcm_block_count * 2)) - 1);
        _estack = ORIGIN(DTCM) + ((16 - _itcm_block_count) << 15);

This works together with the first lines in startup.c.

Code:
__attribute__((section(".startup"), naked))
void ResetHandler(void)
{
        IOMUXC_GPR_GPR17 = (uint32_t)&_flexram_bank_config;
        IOMUXC_GPR_GPR16 = 0x00200007;
        IOMUXC_GPR_GPR14 = 0x00AA0000;
        __asm__ volatile("mov sp, %0" : : "r" ((uint32_t)&_estack) : "memory");

Because the linker can't understand the actual memory size limit, as it was given "wrong" info for the maximum possible size of regions with get partitioned portions of the actual memory, compile time checking is done inside the "teensy_size" utility. The source code is on github. Here's the relevant check, specifically "free_for_local".

https://github.com/PaulStoffregen/t...7dea42ee55035ed9499054e7e84/teensy_size.c#L99


If I increase a buffer size by additional 64K ("free" tells me 83552 is still left) - all compile is clean -but FW crashes!

This is the moment were I wish to gently remind you of the "Forum Rule" which appears in red text at the top of every page of this forum. The forum rule has some flexibility. It is a often a matter of social expectation. Especially for novices and people experiencing some library that doesn't compile, there is a lot of leeway. But on the opposite extreme we have cases like this, where someone claiming to be an expert insists library code which has been very widely used for years has a crucial bug, I expect that sort of strident claims to be backed up with a small-as-possible test case which can be copied into Arduino IDE and run on a Teensy to reproduce the problem.

If you continue to insist the linker script is wrong, with this rather strong tone, you must provide a test case as a small but complete program to demonstrate the problem. Don't make me repeat this!
 
BTW:
the MPU config should look like this (but it does not solve the problem with incorrect linker script):
Code:
SCB_MPU_RBAR = 0x00000000 | REGION(i++); //https://developer.arm.com/docs/146793866/10/why-does-the-cortex-m7-initiate-axim-read-accesses-to-memory-addresses-that-do-not-fall-under-a-defined-mpu-region
	SCB_MPU_RASR = SCB_MPU_RASR_TEX(0) | NOACCESS | NOEXEC | SIZE_4G;
	
	SCB_MPU_RBAR = 0x00000000 | REGION(i++); // ITCM
	SCB_MPU_RASR = MEM_NOCACHE | READWRITE | SIZE_256K;

	// TODO: trap regions should be created last, because the hardware gives
	//  priority to the higher number ones.
	SCB_MPU_RBAR = 0x00000000 | REGION(i++); // trap NULL pointer deref
	SCB_MPU_RASR =  DEV_NOCACHE | NOACCESS | SIZE_32B;

	SCB_MPU_RBAR = 0x00200000 | REGION(i++); // Boot ROM
	SCB_MPU_RASR = MEM_CACHE_WT | READONLY | SIZE_128K;

	SCB_MPU_RBAR = 0x20000000 | REGION(i++); // DTCM
	SCB_MPU_RASR = MEM_NOCACHE | READWRITE | NOEXEC | SIZE_256K;
	
	SCB_MPU_RBAR = ((uint32_t)&_ebss) | REGION(i++); // trap stack overflow
	SCB_MPU_RASR = SCB_MPU_RASR_TEX(0) | NOACCESS | NOEXEC | SIZE_32B;

	SCB_MPU_RBAR = 0x20200000 | REGION(i++); // RAM (AXI bus)
	SCB_MPU_RASR = MEM_CACHE_WBWA | READWRITE | NOEXEC | SIZE_512K;

	SCB_MPU_RBAR = 0x40000000 | REGION(i++); // Peripherals
	SCB_MPU_RASR = DEV_NOCACHE | READWRITE | NOEXEC | SIZE_64M;

	SCB_MPU_RBAR = 0x60000000 | REGION(i++); // QSPI Flash
	SCB_MPU_RASR = MEM_CACHE_WBWA | READONLY | SIZE_8M;

	SCB_MPU_RBAR = 0x70000000 | REGION(i++); // FlexSPI2
	SCB_MPU_RASR = MEM_CACHE_WBWA | READWRITE | NOEXEC | SIZE_16M;

assuming ITCM = 256K and DTCM = 256K
 
The MPU regions should not look like that.
Declaring both ITCM and DTCM as a maximum of 512KB is the right thing to do, because it is the *maximum* size either of them can be. It means the compiler will abort if either of them is too large, and teensy-size (run as a post-compile step) will abort if their combined size is larger than 512KB.

Reducing the "RAM (AXI bus)" MPU region from 1MB to 512KB is also not correct as FlexRAM can also be assigned as OCRAM, enlarging DMAMEM (although this does require modifying the linker script and upsets teensy-size if the static allocations are too big - I'm trying to figure out a pull request to fix this).
 
@PaulStoffregen
Thanks for the explanation in post #15. Really explains quite a bit. Now to say this thread so I don't loose it.
 
I get all of this, but I am sure the Linker Script is wrong.
I have test case for it (separate post).
 
Status
Not open for further replies.
Back
Top