Function and its stack size on Teensy 4

was-ja

Well-known member
Hello,

I am trying to put heavy numerical algorithm into Teensy 4.1 and about to run out of memory.

I have code size about 101K (it is padded into 128K) and wish to optimize it such a way that it is still in fast memory or instruction cache, but at least less that 64K.

Hence, I need complete listing about each function: its real size for the instructions, and its requirements in stack.

Please, advice me is there any method to get this information?

PS: I can print stack pointer in each function, or play with FASTMEM/CACHEDMEM, but it takes a lot of additional efforts, probably there is easier way to do this.

Thank you!
 
You can use gnu nm to print a nice symbol table. Here what it prints for a random sketch:

Code:
00000000 000001d0 T _VectorsFlash
00000022 A _teensy_model_identifier
000001d0 00000200 T ResetHandler
00000400 00000010 T flashconfigbytes
00000410 t __do_global_dtors_aux
00000434 t frame_dummy
0000046c 000000a0 W TeensyStep::MotorControlBase<TimerField>::stepTimerISR()
0000050c 00000018 W TeensyStep::MotorControlBase<TimerField>::pulseTimerISR()
00000524 0000000c W IntervalTimer::~IntervalTimer()
00000524 0000000c W IntervalTimer::~IntervalTimer()
00000530 00000084 W TimerField::delayISR(unsigned int)
000005b4 0000018c W TeensyStep::RotateControlBase<LinRotAccelerator, TimerField>::accTimerISR()
000006aa t L_722_delayMicroseconds
00000740 00000008 T setup
00000748 00000078 T loop
000007c0 00000058 W TeensyStep::MotorControlBase<TimerField>::~MotorControlBase()
000007c0 00000058 W TeensyStep::MotorControlBase<TimerField>::~MotorControlBase()
00000818 00000014 W TeensyStep::RotateControlBase<LinRotAccelerator, TimerField>::~RotateControlBase()
00000818 00000014 W TeensyStep::RotateControlBase<LinRotAccelerator, TimerField>::~RotateControlBase()
0000082c 0000001c W TeensyStep::RotateControlBase<LinRotAccelerator, TimerField>::~RotateControlBase()
00000848 00000014 W TeensyStep::MotorControlBase<TimerField>::~MotorControlBase()
0000085c 00000154 t _GLOBAL__sub_I__Z10signed_sqrf
000009b0 0000006c T TeensyStep::Stepper::Stepper(int, int)
000009b0 0000006c T TeensyStep::Stepper::Stepper(int, int)
00000a1c 00000030 T TeensyStepFTM::removeDelayChannel(unsigned int)
00000a4c 00000044 T ftm0_isr
00000a90 0000001c T usb_serial_available
00000aac 00000048 T usb_serial_flush_callback
00000af4 00000048 W bus_fault_isr
00000af4 00000048 T fault_isr
00000af4 00000048 W hard_fault_isr
00000af4 00000048 W memmanage_fault_isr
00000af4 00000048 W usage_fault_isr
00000b3c 00000006 W adc0_isr
00000b3c 00000006 W adc1_isr
00000b3c 00000006 W can0_bus_off_isr
00000b3c 00000006 W can0_error_isr
00000b3c 00000006 W can0_message_isr
00000b3c 00000006 W can0_rx_warn_isr
00000b3c 00000006 W can0_tx_warn_isr
00000b3c 00000006 W can0_wakeup_isr
00000b3c 00000006 W can1_bus_off_isr
00000b3c 00000006 W can1_error_isr
00000b3c 00000006 W can1_message_isr
00000b3c 00000006 W can1_rx_warn_isr
00000b3c 00000006 W can1_tx_warn_isr
00000b3c 00000006 W can1_wakeup_isr
00000b3c 00000006 W cmp0_isr
00000b3c 00000006 W cmp1_isr
00000b3c 00000006 W cmp2_isr
00000b3c 00000006 W cmp3_isr
00000b3c 00000006 W cmt_isr
00000b3c 00000006 W dac0_isr
00000b3c 00000006 W dac1_isr
00000b3c 00000006 W debugmonitor_isr
00000b3c 00000006 W dma_ch0_isr
00000b3c 00000006 W dma_ch1_isr
00000b3c 00000006 W dma_ch10_isr
00000b3c 00000006 W dma_ch11_isr
00000b3c 00000006 W dma_ch12_isr
00000b3c 00000006 W dma_ch13_isr
00000b3c 00000006 W dma_ch14_isr
00000b3c 00000006 W dma_ch15_isr
00000b3c 00000006 W dma_ch2_isr
00000b3c 00000006 W dma_ch3_isr
00000b3c 00000006 W dma_ch4_isr
00000b3c 00000006 W dma_ch5_isr
00000b3c 00000006 W dma_ch6_isr
00000b3c 00000006 W dma_ch7_isr
00000b3c 00000006 W dma_ch8_isr
00000b3c 00000006 W dma_ch9_isr
00000b3c 00000006 W dma_error_isr
00000b3c 00000006 W enet_error_isr
00000b3c 00000006 W enet_rx_isr
00000b3c 00000006 W enet_timer_isr
00000b3c 00000006 W enet_tx_isr
00000b3c 00000006 W flash_cmd_isr
00000b3c 00000006 W flash_error_isr
00000b3c 00000006 W ftm1_isr
00000b3c 00000006 W ftm2_isr
00000b3c 00000006 W ftm3_isr
00000b3c 00000006 W i2c0_isr
00000b3c 00000006 W i2c1_isr
00000b3c 00000006 W i2c2_isr
00000b3c 00000006 W i2c3_isr
00000b3c 00000006 W i2s0_isr
00000b3c 00000006 W i2s0_rx_isr
00000b3c 00000006 W i2s0_tx_isr
00000b3c 00000006 W low_voltage_isr
00000b3c 00000006 W lptmr_isr
00000b3c 00000006 W lpuart0_status_isr
00000b3c 00000006 W mcg_isr
00000b3c 00000006 W mcm_isr
00000b3c 00000006 W nmi_isr
00000b3c 00000006 W pdb_isr
00000b3c 00000006 W pit_isr
00000b3c 00000006 W porta_isr
00000b3c 00000006 W portb_isr
00000b3c 00000006 W portc_isr
00000b3c 00000006 W portcd_isr
00000b3c 00000006 W portd_isr
00000b3c 00000006 W porte_isr
00000b3c 00000006 W randnum_isr
00000b3c 00000006 W rtc_alarm_isr
00000b3c 00000006 W rtc_seconds_isr
00000b3c 00000006 W sdhc_isr
00000b3c 00000006 W software_isr
00000b3c 00000006 W spi0_isr
00000b3c 00000006 W spi1_isr
00000b3c 00000006 W spi2_isr
00000b3c 00000006 W svcall_isr
00000b3c 00000006 W tpm0_isr
00000b3c 00000006 W tpm1_isr
00000b3c 00000006 W tpm2_isr
00000b3c 00000006 W tsi0_isr
00000b3c 00000006 W uart0_error_isr
00000b3c 00000006 W uart0_lon_isr
00000b3c 00000006 W uart0_status_isr
00000b3c 00000006 W uart1_error_isr
00000b3c 00000006 W uart1_status_isr
00000b3c 00000006 W uart2_error_isr
00000b3c 00000006 W uart2_status_isr
00000b3c 00000006 W uart3_error_isr
00000b3c 00000006 W uart3_status_isr
00000b3c 00000006 W uart4_error_isr
00000b3c 00000006 W uart4_status_isr
00000b3c 00000006 W uart5_error_isr
00000b3c 00000006 W uart5_status_isr
00000b3c 00000006 T unused_isr
00000b3c 00000006 W usb_charge_isr
00000b3c 00000006 W usbhs_isr
00000b3c 00000006 W usbhs_phy_isr
00000b3c 00000006 W wakeup_isr
00000b3c 00000006 W watchdog_isr
00000b44 0000000c t startup_default_early_hook
00000b44 0000000c W startup_early_hook
00000b50 00000002 t startup_default_late_hook
00000b50 00000002 W startup_late_hook
00000b54 0000002c T _sbrk
00000b80 00000002 W __cxa_pure_virtual
00000b84 00000034 T kinetis_hsrun_disable
00000bb8 00000034 T kinetis_hsrun_enable
00000bec 00000078 t pinMode.part.2
00000c64 00000024 T rtc_set
00000c88 00000002 t startup_default_middle_hook
00000c88 00000002 W startup_middle_hook
00000c8c 0000000a T pinMode
00000c98 000000a4 T delay
00000d3c 00000170 T _init_Teensyduino_internal_
00000eac 00000040 T usb_malloc
00000eec 0000004c T usb_free
00000f38 00000084 T usb_rx_memory
00000fbc 00000084 T usb_tx
00001040 00000874 T usb_isr
000018b4 000000c4 T usb_init
00001978 00000002 t dummy_funct()
0000197c 00000044 T IntervalTimer::end()
000019c0 00000014 T pit0_isr
000019d4 00000014 T pit1_isr
000019e8 00000014 T pit2_isr
000019fc 00000014 T pit3_isr
00001a10 000000ec W yield
00001afc 00000044 T EventResponder::runFromInterrupt()
00001b40 00000004 T pendablesrvreq_isr
....

The first column shows the address of the symbol (data / function), the second shows the size. Size of functions is somewhat 'undefined' since after optimizing functions tend to share parts. Here information on how to use nm: https://sourceware.org/binutils/docs/binutils/nm.html.

The user wiki has some infos on how to integrate nm into your build system: https://github.com/TeensyUser/doc/wiki/GCC#List-and-symbol-files

Edit: Here an interesting article on how to do analyze memory usage using gcc: https://embeddedartistry.com/blog/2020/08/17/three-gcc-flags-for-analyzing-memory-usage/
 
Super, thank you very much, luni!!!

"nm" is what I have expected, but I was unable to figure out exactly how to run it from teensy environment.

I tried, got old *.sym and new one *.symnm as you have suggested and figured out one follow up question.

I have two parts of my algorithm - the first one uses heavy numerics and SPI input output with external ADC, and the second part is related to SD card I/O. I do not need SD card to be fast, but the first part is critical for me. However, most of SD card functions stay in ITCM:

Code:
...
00000684  w    F .text.itcm     0000001a SDFile::position()
00008730 g     F .text.itcm     00000094 SdSpiCard::writeStop()
000087c4 g     F .text.itcm     000000bc SdSpiCard::writeData(unsigned char const*)
000007d4  w    F .text.itcm     0000002c File::peek()
000083e4 g     F .text.itcm     000000c0 FatPartition::freeClusterCount()
000044a8  w    F .text.itcm     000000c4 SDClass::open(char const*, unsigned char)
00008880 g     F .text.itcm     000000f4 SdSpiCard::writeStart(unsigned long)
000008d4  w    F .text.itcm     00000034 StreamFile<FsBaseFile, unsigned long long>::write(unsigned char)
000006bc  w    F .text.itcm     00000018 SDFile::read(void*, unsigned int)
000005fc  w    F .text.itcm     00000044 SDFile::rewindDirectory()
000051ec g     F .text.itcm     0000000e ExFatFile::open(ExFatVolume*, char const*, int)
...

please, advice me how to force them out of ITCM into CACHEMEM to save ITCM memory?

Thank you!
 
You need to place FLASHMEM in the function declarations. E.g. if you compile this:
Code:
void setup(){
}

void loop(){
}

You see that loop is placed in ITCM at address 0x80

Code:
00000000 T _stext
00000001 A _itcm_block_count
00000020 t __do_global_dtors_aux
00000025 A _teensy_model_identifier
00000044 t frame_dummy
0000007c 00000002 T setup
00000080 00000002 T loop <=============================================
00000084 00000004 W TeensyTimerTool::ITimerChannel::getPeriod()
00000088 00000012 W std::function<void ()>::~function()
00000088 00000012 W std::function<void ()>::~function()
0000009c 00000020 t __tcf_0
000000bc 00000020 t __tcf_1

If you write FLASHMEM before loop

Code:
void setup(){
}

FLASHMEM void loop(){
}

You see that loop now lives at address 0x60001654 which is in flash.

Code:
60001418 00000010 t memory_clear
60001428 0000022c T ResetHandler
60001654 00000002 T loop <=====================================
60001658 00000002 T startup_default_early_hook
60001658 00000002 W startup_early_hook
6000165c 00000002 T startup_default_middle_hook

If you want to keep library functions in flash you need to add FLASHMEM to them in the library (I don't know of any more elegant way to do this). You can also change the linker script. IIRC @FrankB posted some changed linker script which doesn't copy functions to ITCM by default.

EDIT: here the mentioned thread showing the alternate linker script https://forum.pjrc.com/threads/5732...ferent-regions?p=230521&viewfull=1#post230521 A lot of things changed since this post, so maybe the script needs adjustments to the current TD. Best ask FrankB
 
Last edited:
Thank you very much, luni for your kind assistance!

Yes, I played with FLASHMEM at my function and it works, but actually I need SD card library to be in the flash, and it seems that it is heavy enough. According to your suggestion, I will probably copy SD card class and all related functions and try to manually add FLASHMEM.

I tried to search the linker script of FrankB regarding to ITCM, but did not found it. Please, if possible give me some hints who to find it!

Thank you!
 
Also note that you will only save memory if I reduce the program space enough to need fewer 32KB pages for code...

That is for example the last sketch I built:
Code:
SD_Program_SPI_QSPI_MTP-logger.ino.elf"
teensy_size: Memory Usage on Teensy 4.1:
teensy_size:   FLASH: code:145092, data:19084, headers:8876   free for files:7953412
teensy_size:    RAM1: variables:32224, code:136312, padding:27528   free for local variables:328224
teensy_size:    RAM2: variables:16128  free for malloc/new:508160
It says that the code was 136312 bytes so this would take about 4.16 pages

So: 5*32768-136312 = 27528 which is the padding value.

Or again if I would reduce the ITCM usage by: 5240 bytes than it will only need 4 pages for ITCM
which would add 32KB to DTCM...
 
Also note that you will only save memory if I reduce the program space enough to need fewer 32KB pages for code...
Thank you very much, KurtE! Yes, you are right!

Actually, with SD card library, if I removing it from my program, it saves about 36K of ITCM that leads me to save two 32K pages, and I really need this 64K space for computations.
 
BTW: Did you try to switch to nanolib? You can easily test this by using the Optimize for smallest code option.
 
Thank you very much, luni, for the info regarding nanolib.

Please, excuse for the stupid question - I tried to find myself how to enable nano-lib in arduino/teensyduino environment, but failed. Please, suggest me should I install VisualTeensy? Does it work on Ubuntu?

Regarding to code optimization. I have some parts of my code that are sensitive to optimization option. So, as I see right now, about 30Kbytes of my code should be fully optimized and situated in ITCM or cache, and the rest (ca. 70Kbytes) - is just setup and I/O that executes at begin/end of application and its performance does not matter for me. The total memory usage is actually critical for me, I used hierarchical access to slow psram and dmamem, and fast DTCM, and all DTCM is right now occupied.

Is there possibility to compile one part for -O3, and other parts with -Os? Actually I am very beginner in teensy's - the T4.1 is my first board from this series, so, please, excuse me for stupid questions.
 
Please, excuse for the stupid question - I tried to find myself how to enable nano-lib in arduino/teensyduino environment, but failed.

As mentioned above, just choose Optimize "Smallest Code"

Screenshot 2021-12-12 110802.png

If this brings the size of your sketch far enough down you can selectively enable optimization by using #pragma optimize (see e.g. https://stackoverflow.com/a/2220565)

Code:
#pragma GCC push_options     // this stores the current settings
#pragma GCC optimize ("O3")  // change to level O3

void myFastFunction()
{
// ....
}

void anotherFastFunction()
{
//...
}

#pragma GCC pop_options      // restore old settings

Using the IDE setting smallest code not only changes the optimization level to -Os but also links in nanolib instead of newlib which usually brings down the code size significantly. Using #pragma GCC optimize ("O3") on a project built with IDE setting "smallest code" will use nanolib with highest optimization level.
 
Last edited:
Thank you very much, luni! Your kind advice finally solved my issue!

By using your suggestion, the total code size dropped from 101K to 63K that saved me two 32K blocks in ITCM, and, in addition the nanolib requires less internal variables, that saved additional 4K in DTCM.

Super! Thank you very much!
 
Back
Top