Problem with memory usage in Teensy build

dgnuff

Active member
Code:
1>------ Build started: Project: SmartHome, Configuration: Release x64 ------
1>Memory Usage on Teensy 4.1:
1>  FLASH: code:468856, data:78796, headers:8376   free for files:7570436
1>   RAM1: variables:88160, code:434672, padding:24080   free for local variables:-22624
1>   RAM2: variables:79488  free for malloc/new:444800
1>Error program exceeds memory space

Says it all. If it makes a difference, I'm using arduino-cli to compile the project from within Visual Studio. Building it using the Arduino IDE produces the same results.

I've pushed as much of my project as I can into FLASHMEM, but it's got to be pulling in a huge amount of library code which is causing the overflow. What options do I have to determine what library code is being pulled in to make the image so large?

A brief experiment showed that the QNEthernet library clocks in at about 100k. That's big, but I can live with it since Ethernet connectivity is vital to this project.

The one thing I'm beginning to suspect is the used of STL templated classes, e.g. std::vector, std::string, std::unordered_map etc. etc. etc. I'm aware of the String class that exists in the Arduino libraries, was that created precisely to avoid the code bloat that can potentially happen with std::string?

Many many many moons ago, one of the first projects I worked on in C++ had a simple rule: no use of standard template classes, and when I asked why, the answer was code bloat. They had been found to expand the output binary so much that it risked not fitting on one of the targets: the Sony PS1. That'll tell you how long ago this was. :) Could I be facing the same problem here?
 
Last edited:
If you use the "export compiled binary" option in the Arduino IDE it should place some extra files in the sketch folder, including a symbol listing with sizes to see what is taking up the most space.
 
If you use the "export compiled binary" option in the Arduino IDE it should place some extra files in the sketch folder, including a symbol listing with sizes to see what is taking up the most space.

Just did that to post and it gave build error so posted that ... Seems to have built and worked fine but some linkage to altered file location is missed and triggered 'error'.

So that should give info on sizes.
Run this on the .ELF at hand: arm-none-eabi-objdump -t BroadcastChatUDP.ino.elf
Here was on Windows was found: "C:\Users\<your acct>\AppData\Local\Arduino15\packages\teensy\tools\teensy-compile\11.3.1\arm\bin\arm-none-eabi-objdump.exe"

Note:
This shows only 9KB of code or 25KB of data removed would allow it to build freeing another 32KB : padding:24080
Uninitialized data moved to DMAMEM/RAM2 might be an option - any data not working from 32KB cache will be slower access.
Find that 9KB more code to move
 
If you use the "export compiled binary" option in the Arduino IDE it should place some extra files in the sketch folder, including a symbol listing with sizes to see what is taking up the most space.

I did as suggested, found a sym file, and gave it a good look over. Filtering out everything in section .text.itcm and then doing a column block sum on the sizes came up with almost exactly the count of bytes in RAM1, so that looks like I'm headed in the correct direction.

I then did a deep dive on just the .text.itcm items. TL;DR It's a combination of three problems, one is easy to solve, the other two less so.

1. All the STL stuff winds up with a stl:: prefix, so a quick sort by symbol name got that all grouped, and then a column block sum over the sizes of those elements adds up to a whopping great almost 200k. That's the bulk of the problem right there. That's relatively easy to solve, with some work I can implement scaled back versions of what I'm using that will have just enough functionality to do what I need. I've already dome that for C++17's std::string_view since Arduino in general, and Teensy in specific appear to be using C++14, and I was not about to rewrite my JSON parser to use std::string everywhere.

2. I don't know how much this adds up to, but it is painfully obvious that the linker is not doing "dead code elimination". That's where each function in a library is viewed as a separate object, and only the functions actually used are pulled in. By comparison, I'm only using strlen() from string.h, I see that in the sym file, along with just about everything else in the C string library, including a bunch of functions I'm not using. No idea how to turn that on, or if it's even possible, but I'd make book it'd help a lot.

3. I'm using printf / scanf for formatted I/O and they pulls in about 64K all told. They're some of the bigger line items in the sym file, but I'm willing to pay that cost for the functionality that printf offers. I could very easily remove scanf, I'm only using it one place and manual parsing of the string in question should be relatively trivial.

So that leaves the dead code elimination problem. Is there anything at all that can be done to ease the pain of that problem? Or is that just something that MSVC does by default that GCC simply hasn't caught up with?
 
I also use Visual Studio but with Visual Micro.
Using that you can use the LTO compile/link options.
I believe that it's the LTO option that takes out unused code.
 
I also use Visual Studio but with Visual Micro.
Using that you can use the LTO compile/link options.
I believe that it's the LTO option that takes out unused code.

You need to pass -fdata-sections and -ffunction-sections to the GCC compiler when compiling ALL code and --gc-sections to the linker to eliminate unused data and functions in the ELF executable. Perhaps those aren't being used.
 
Code:
1>------ Build started: Project: SmartHome, Configuration: Release x64 ------
1>Memory Usage on Teensy 4.1:
1>  FLASH: code:195256, data:55244, headers:8568   free for files:7867396
1>   RAM1: variables:59456, code:172008, padding:24600   free for local variables:268224
1>   RAM2: variables:79488  free for malloc/new:444800

That seems to have fixed it. Code is now down to a little over 1/3 its size when problems started.

Call it a bit of programmer's intuition if you will, but a bunch of the symbols that were showing up in the std:: namespace didn't make any sense. Things to do with times, and money and who knows what. I found #include <sstream> lurking in one file, just for the benefit class std::stringstream used in one of my utility routines: a C++ version of PHP's explode / Python's split. It breaks a std::string into a std::vector<std::string>, splitting at a user defined delimiter. I refactored Explode() to use a C algorithm that I had around from a project I wrote in the late 1990's, thus allowing me to remove
Code:
<sstream>
. Next on the hit list was
Code:
std::regex
. Finally I went through and got rid of every instance of std::string, replacing them with a scaled back version with just the minimum functionality I need.

My programmer's intuition is screaming at me that it's <sstream> and/or <regex> that are the main guilty parties. I'll probably take a break, revisit this in the near future, and restore std::string and see what happens. I still have std::vector, std::unordered map, std::function and a few others around the place, so not everything is that greedy. And I know that over half of my current payload is QNEthernet, so the rest of this is pretty small. Plus there's that scanf() call I can now get rid of. Removing std::regex caused me to write a few parsing routines that will very handily do the work currently assigned to sscanf().

Overall takeaway from this. Watch your memory usage like a hawk if you start throwing the more esoteric parts of the STL in. There's wisdom in the decisions made by that project team back in the mid 1990's regarding templates.
 
You need to pass -fdata-sections and -ffunction-sections to the GCC compiler when compiling ALL code and --gc-sections to the linker to eliminate unused data and functions in the ELF executable. Perhaps those aren't being used.

I'll take another peek at what's going on by throwing Sysinternals ProcMon at the problem. That will capture the whole command line, and let me save to a file for later analysis. I need to take a break from this for a day or so, so I'll drop back in a while when I've figured what's going on.

Thanks for all the help everyone!
 
If over half the payload is QNEthernet, have you explored reducing the feature set included with the underlying lwIP stack? For example, removing TCP or reducing the number of possible sockets (all protocols), or even reducing the memory size, etc. Some questions:
1. What is its current footprint?
2. Have you made any modifications that might increase its size?
3. Which has higher usage, RAM1 or RAM2? How much of each?
 
Just for science, maybe try using -Os (optimize for size)? The string formatting code blows up considerably when it has to handle floating point numbers, -Os will exclude all that (may break things if you actually need it).
 
...

That seems to have fixed it. Code is now down to a little over 1/3 its size when problems started.
...

Great news if the code builds/uploads and runs as expected. Until recent TeensyDuino update (1.59 Beta?) the use of LTO was not functional.

Seeing notes on results or modifications to the build line from that included in the TeensyDuino release for safe use and maximal elimination of unused code would be welcome/important.
Regarding: p#6 -f'options' and p#8 "Sysinternals ProcMon"

And yes p#10 building 'smallest code' is a way to save perhaps a great deal of code space as seen in a recent thread - the size drop was very large - but that isn't the same as LTO removing unused code that tags along.
 
The thing is LTO should absolutely not be required to remove unused/unreferenced symbols, as already posted -fdata-sections and -ffunction-sections with --gc-sections should be enough to achieve the same. LTO is more for inlining external functions only called from one place, etc.
 
The thing is LTO should absolutely not be required to remove unused/unreferenced symbols, as already posted -fdata-sections and -ffunction-sections with --gc-sections should be enough to achieve the same. LTO is more for inlining external functions only called from one place, etc.

Interesting point. @MichaelMeissner worked in that world - I mistook it for a subset of LTO options.
 
You need to pass -fdata-sections and -ffunction-sections to the GCC compiler when compiling ALL code and --gc-sections to the linker to eliminate unused data and functions in the ELF executable. Perhaps those aren't being used.

Looking at: ...\AppData\Local\Arduino15\packages\teensy\hardware\avr\1.58.1\boards.txt

It shows these lines that seem to match the indicated for compiling and linking commands in boards.txt (for all boards):
Code:
teensy41.name=Teensy 4.1
...
teensy41.build.flags.ld=-Wl,[B][U]--gc-sections[/U][/B],--relax "-T{build.core.path}/imxrt1062_t41.ld"
...
teensy41.build.flags.common=-g -Wall [B][U]-ffunction-sections -fdata-sections[/U][/B] -nostdlib
...
 
And platformio also includes those options. If you look in arduino.py, you will find
Code:
       CCFLAGS=[
            "-Wall",  # show warnings
[B]            "-ffunction-sections",  # place each function in its own section
            "-fdata-sections",
[/B]            "-mthumb",
            "-mcpu=%s" % env.BoardConfig().get("build.cpu"),
            "-nostdlib"
        ],
and
Code:
       LINKFLAGS=[
            "-Wl,[B]--gc-sections[/B],--relax",
            "-mthumb",
            "-mcpu=%s" % env.BoardConfig().get("build.cpu"),
            "-Wl,--defsym=__rtc_localtime=$UNIX_TIME"
        ],
 
The same -f options have to be used for all linked static libraries as well, otherwise their unused symbols can't be pruned.
 
If over half the payload is QNEthernet, have you explored reducing the feature set included with the underlying lwIP stack? For example, removing TCP or reducing the number of possible sockets (all protocols), or even reducing the memory size, etc. Some questions:
1. What is its current footprint?
2. Have you made any modifications that might increase its size?
3. Which has higher usage, RAM1 or RAM2? How much of each?

The footprint estimate is very crude. Starting from the build that produced the numbers in the first message I posted, I simply stubbed out the socket layer so all functions were empty, removed QNEthernet.h from all files where it appears and then fixed the few errors that showed up as a result. That then produced a total code size of around 380,000 so I was taking a first order approximation and calling 468K - 380K ~100K. I'm not striving for accuracy to the byte, just trying to get a read as to the order of magnitude involved.

All that said, the total code footprint is 194268, of which FLASHMEM keeps about 27k out of RAM1. RAM1 data usage is relatively lightweight at 32K and change leaving 294K available. RAM2 will be a bit bigger since I bumped from 8 to 16 TCP connections, but as shown by the build, about 80K RAM2 is used leaving 444K available. So that's just not a worry. Once I gert the rest of the code done, I may well stop trying to run from FLASHMEM, and just let the whole thing run from RAM1. I can't find it documentaed anywhere what sort of performance hit you take running from FLASHMEM, but I have noted another thread where someone had a routine that flat out crashed if run from FLASHMEM.

About the only modification I've made is bumping MEMP_NUM_TCP_PCB to 16, and to my understanding that mostly affects RAM2 usage for the buffers.

Higher usage is RAM1 by a factor of 3 to 1, but that's because of the code. Current build stats are:

Code:
1>------ Build started: Project: SmartHome, Configuration: Release x64 ------
1>Memory Usage on Teensy 4.1:
1>  FLASH: code:194268, data:28620, headers:8532   free for files:7895044
1>   RAM1: variables:32864, code:167396, padding:29212   free for local variables:294816
1>   RAM2: variables:79488  free for malloc/new:444800

That doesn't include malloc / new, and using the various STL template classes that I am using will do a modest amount of allocation. Even so, I don't see RAM2 as being in any sort of danger.

As regards scaling back QNEthernet / Lwip, the only two things that I can obviously turn off are MDNS and IGMP. However attempts to do so have all resulted in the following error:

Code:
Build started...
1>------ Build started: Project: SmartHome, Configuration: Release x64 ------
1>F:\Dev\Arduino\libraries\QNEthernet\src\lwip\apps\mdns\mdns.c(73,2): error GE39423A9: #error "If you want to use MDNS with IPv4, you have to define LWIP_IGMP=1 in your lwipopts.h"
1>   73 | #error "If you want to use MDNS with IPv4, you have to define LWIP_IGMP=1 in your lwipopts.h"

etc. etc. So it looks like I have IGMP disabled, but haven't yet found the incantation required to disable MDNS.

#define LWIP_MDNS_RESPONDER 0

above the include of QNEthernet.h doesn't help.

Changing to:

#define LWIP_AUTOIP 0 /* 0 */
#define LWIP_DHCP_AUTOIP_COOP 0 /* 0 */


doesn't help, even though the comment above notes "Add both for mDNS". You'd think that setting both to zero as the comments suggest should do the trick, but apparently not. And lastly:

#define LWIP_DNS_SUPPORT_MDNS_QUERIES 0 /* 0 */

doesn't help. At this point I'm out of ideas for how to turn off mDNS.

As noted above, I'm not really stressing it at this point. If I can do it, great, if not, I don't think I'm anywhere near in trouble. Cutting #include <sstream> and #include <regex> significantly reduced the code size, they were something over half the entire code size between them when you compare the recent build stats above to my original post at the top of the thread.
 
Just for science, maybe try using -Os (optimize for size)? The string formatting code blows up considerably when it has to handle floating point numbers, -Os will exclude all that (may break things if you actually need it).

Probably not an option. Among other things, this project has code to determine the time of civil sunrise and sunset on any given day at (almost) any point on Earth. That code is very highly dependent on floating point being available, I'm fairly certain it could not be reliably implemented using integer only math. Theoretically possible with a 16.16 fixed point format, I'm just not motivated to dig up and implement the Chebychev polynomials for sin, cos, tan etc. that I'm using. I've done that many moons ago on the Z80, back when CP/M was still a thing. Once in a lifetime is enough for me. :)
 
Setting those defines inside your project source code won’t work. You either need to set those for the whole build or change them inside lwipopts.h (inside the library). The first approach is harder to do via the Arduino IDE, but in PlatformIO, just add them to `build_flags` in your platformio.ini file. The second approach requires you to edit the library source code.

In other words, to exclude mDNS, either pass `-DLWIP_MDNS_RESPONDER=0` to the compiler for all files (including all library files) or change lwipopts.h inside the library. Note: to disable mDNS, it is sufficient to set only that one macro to zero.

See also:
* This section in the README: https://github.com/ssilverman/QNEth...4b123c4273d76f/README.md#configuration-macros
* This discussion about getting this to work using the Arduino IDE: https://github.com/ssilverman/QNEthernet/issues/33
 
Last edited:
Probably not an option. Among other things, this project has code to determine the time of civil sunrise and sunset on any given day at (almost) any point on Earth. That code is very highly dependent on floating point being available, I'm fairly certain it could not be reliably implemented using integer only math. Theoretically possible with a 16.16 fixed point format, I'm just not motivated to dig up and implement the Chebychev polynomials for sin, cos, tan etc. that I'm using. I've done that many moons ago on the Z80, back when CP/M was still a thing. Once in a lifetime is enough for me. :)

-0s doesn't disable floating point math, functions, instruction or calculations - just the formatting code for printing.
 
Setting those defines inside your project source code won’t work. You either need to set those for the whole build or change them inside lwipopts.h (inside the library). The first approach is harder to do via the Arduino IDE, but in PlatformIO, just add them to `build_flags` in your platformio.ini file. The second approach requires you to edit the library source code.

In other words, to exclude mDNS, either pass `-DLWIP_MDNS_RESPONDER=0` to the compiler for all files (including all library files) or change lwipopts.h inside the library. Note: to disable mDNS, it is sufficient to set only that one macro to zero.

Finally got it done, and it turned out not to be worth the effort. Removing MDNS and IGMP only saved 7k of code. Worse yet, I had to do a bit of hacking in the source itself, since the removal of IGMP caused a couple of undefined references. While I was able to get a successful build by commenting out a few lines of code, I decided against it. Who knows what could have been broken by the required changes, and for 7k of code, that's not a chance that's worth taking.

If you're curious, netif_set_igmp_mac_filter, groupaddr, igmp_joingroup_netif and igmp_leavegroup_netif showed up as either missing at compile time, or undefined references at link time.
 
Thanks for pointing that out (removing IGMP causes build errors). I've fixed this and it will be available the next time I push.

Now you should just be able to set `LWIP_IGMP` to `0` in _lwipopts.h_ and not have to worry about anything else.
 
I also just added a way to disable and exclude TCP, DHCP, DNS, and UDP. That's in addition to mDNS and IGMP.

Each should just be a simple one-line change to _lwipopts.h_.
 
Back
Top