Teensy4.1 Firmware size question

Hello,

I am not a firmware generation expert, and I would like some help to understand a weird behaviour on my firmware size. I am not able to understand link between
variable size used in my code and firmware size.
We are using a lot of big const float arrays in our firmware. The firmware is too big to fetch the 8MB flash...

The problem is that the size of our array does not match the size in the firmware.

For example, if I am adding an array of 128 parameters of float (float 32 bits). I am expecting a new firmware size 512 bytes bigger (128 * 4 bytes).
But my new firmware is 1445 bytes bigger.

I tried to compare the two mapping file generated by the compilation, and I cannot understand this difference. Is it a normal result ? Why do I need 3 times
the space of variable in my final firmware?

Thanks,
Arthur.
 
I could understand 2x the size: At runtime, the data are in RAM. But they need to be in flash, too, because the cpu has to copy it from FLASH to RAM at startup.
If you want to have it in flash only, you can add PROGMEM: const PROGMEM ...

But I can not explain 3x the size. Probably there is a larger gap of unused memory - often, the linker does no good job.
Or do you compare the HEX file sizes? (that would be wrong)
 
Hi mcu32,

I am already using the PROGMEM keywork with my current result:(. It was impossible to put all these data in the 512k RAM1 or RAM2 space.
Yes I am seeing this +1445 bytes, in the final hex file with "du -b xxxx.hex" command line.
When I am checking the map file, my data seems to use the expected space (around 512 bytes). I can see that the .o file is 512 bytes more larger.
I can see a .data, .text.csf and .text.progmem seems to be 512 bytes larger in the map file. I am not sure how to read correctly the information in this
map file. I will try to find more documentation about this...
 
The hex file size says nothing. It is a text file, showing hex numbers and CRC values (open it in a simple text editor...)

Teensy displays the real size after compile, in the black window.
 
ah !

Yes VS provides me the ITCM/DTCM, RAM and FLASH memory utilization report.
So if the hex file is 10 MB, is this not an issue? the teensy cli does not copy these 10 MB in the board during the flashing process?
 
So if the hex file is 10 MB, is this not an issue? the teensy cli does not copy these 10 MB in the board during the flashing process?

The hex file contains ASCII text. Each line is an "Intel hex" record that contains the flash address and data bytes. The Teensy bootloader receives and parses those ASCII text records and writes the binary data to flash. The hex file is typically 2.5 to 3 times the size in bytes of the flash image. A quick test here shows a flash image ~70K bytes leads to a hex file ~185K.
 
If this hex file size is not my issue, I try to redirect my investigation in another direction to explain my issue.
I was playing with my memory utilization report again, and it seems that my board is not booting anymore when I reach the 2MB in the flash:

Code:
Used ITCM: 68KB out of 512KB (13%) [+208]
Used DTCM: 45KB out of 512KB (8%)
Used RAM: 167KB out of 512KB (32%)
Used FLASH: 1961KB out of 7936KB (24%) [+1024]
-> OK
Used ITCM: 68KB out of 512KB (13%) [-240]
Used DTCM: 45KB out of 512KB (8%)
Used RAM: 167KB out of 512KB (32%)
Used FLASH: 2058KB out of 7936KB (25%) [-1024]
-> KO

I am using the 4.1 linker script from the official repository:

Code:
MEMORY
{
	ITCM (rwx):  ORIGIN = 0x00000000, LENGTH = 512K
	DTCM (rwx):  ORIGIN = 0x20000000, LENGTH = 512K
	RAM (rwx):   ORIGIN = 0x20200000, LENGTH = 512K
	FLASH (rwx): ORIGIN = 0x60000000, LENGTH = 7936K
}

Is there an important preprocessor option or compile flags to be able to use the 8MB support in the flash?
 
I just fixed the issue... I was losing my time to try to investigate this hex file issue, but it was just an issue with the 8MB flash support...
I found a ARDUINO_TEENSY40 preprocessor option in my visualGDB project. I just updated the preprocessor option to ARDUINO_TEENSY41...
And now it is working....

Thank you for your help! I was trying to fix the wrong issue since the beginning ahah. So sad !
 
been there.. we all have experiences like this. Maybe not with exactly that... but there are enough such errors...
 
Another question about this PROGMEM usage.

I am quite surprised by the current performance of my code. I am playing with GPIO and an oscilloscope to benchmark my code performance. And it is pretty bad.
It looks like my code using the PROGMEM keyword on variable (so in FLASH instead of RAM), is 5 times slower. Is it the expected result of code running in FLASH vs RAM?
For the variables which are in the flash, I also tried to put FLASHMEM keyword in the function declarations which are using this PROGMEM variable, but there was no real difference this time.

Are there tips I can use to improve my performance?
 
Another question about this PROGMEM usage.

I am quite surprised by the current performance of my code. I am playing with GPIO and an oscilloscope to benchmark my code performance. And it is pretty bad.
It looks like my code using the PROGMEM keyword on variable (so in FLASH instead of RAM), is 5 times slower. Is it the expected result of code running in FLASH vs RAM?
For the variables which are in the flash, I also tried to put FLASHMEM keyword in the function declarations which are using this PROGMEM variable, but there was no real difference this time.

Are there tips I can use to improve my performance?

Teensy 4.0 and 4.1 use the NXP iMXRT1062 microcontroller, which has no internal flash. The program flash is an external 4-bit serial flash. The processor supports execution code from the serial flash, but it is slower than executing code from RAM. There is a cache, but if you are trying to execute a lot of code from flash, it will be slower. By default, code is copied from flash to RAM for execution. Is there some reason you don't want to execute code from RAM?
 
I am doing a lot of vector computation. I have around 2.5Mb of const float array to compute. There is not enough space in RAM1 or RAM2 :(
 
I am doing a lot of vector computation. I have around 2.5Mb of const float array to compute. There is not enough space in RAM1 or RAM2 :(

T4.1 supports 4-bit serial PSRAM, which is fast enough for many applications, but might be about the same as serial flash in terms of speed for reading. If your application is that large, it might be better suited for a platform with external memory buses, such as RPI.
 
It looks like my code using the PROGMEM keyword on variable (so in FLASH instead of RAM), is 5 times slower. Is it the expected result of code running in FLASH vs RAM?

The expected difference varies pretty dramatically depending on how well (or poorly) the pair of 32K caches inside the M7 processor and buffer inside FlexSPI are utilized. The underlying hardware is so much slower than only 5X for cache misses. So to answer your question, 5X is somewhere in the middle of a very wide range of expected speed difference.

Also whether whether RAM means the tightly coupled ITCM & DTCM memory (aka "RAM1") or the AXI-bus memory ("RAM2") matters on a technical level. By simple clock speed comparison, RAM1 is 4X faster than RAM2, but RAM2 also leverages the data cache. But both are 64 bit wide buses inside the chip, which is vastly faster than 4 bit external bus, which also carries command and address overhead on the same 4 signals.


For the variables which are in the flash, I also tried to put FLASHMEM keyword in the function declarations which are using this PROGMEM variable, but there was no real difference this time.

They should be the same, since PROGMEM and FLASHMEM are identical. Two names exist only due to compiler limitations. PROGMEM is meant for const variables, FLASHMEM is meant for code.

You can expect similar speed for EXTMEM accessing PSRAM chips, as they use FlexSPI2, which is about the same speed as FlexSPI accessing the flash memory. PSRAM might have a tiny improvement due to lower command overhead, but that should be a very minor difference. FlexSPI and FlexSPI2 are separate hardware which can run concurrently, so there might be some small possible gain if you use both in a crafty way. But both are cached by the same Cortex M7 data cache, so you could also pretty easily end up with a situation where using both puts more pressure on the cache, ending up with more cache misses and overall much worse performance as a result of not using the cache as well.


Are there tips I can use to improve my performance?

Try to structure your work and data storage for best locality of reference to utilize the processor's cache. This applies to code running on nearly all modern processors, so there is a lot of research and knowledge online about improving locality of reference for a variety of algorithms. But often there is only so much that can be done for certain types of algorithms.
 
Last edited:
Best is to optimize the algorithm. Maybe there is a way to transfer the data to RAM (if there is a way to decrease the amount of data)
 
Last edited:
Back
Top