Stack v Heap

roncos · Sep 2, 2021

I'm using the teensy 4.1 for a new project. The project uses a lot of IO. I have maxed out the T41 pins and added a few analogue and digital multiplexers as well.

I am designing the code structure now and, I am considering avoiding hard coding all the IO and instead using a data defined structure to the program. All the state information would be stored in a linked list of objects that get created at setup time according to a config file and updated at run time according to a set of rules.

The low level code that actually manipulates the pins, or rather makes the library calls to manipulate the pins, would be more of less the same in both cases.

One difference between this approach and a simpler hard code everything approach is that I think this approach would use primarily heap memory where as the simpler hard coding everything approach would use stack memory. (Even though it would be using the heap the memory would be allocated at setup and not changed, so there would not be a problem with fragmentation.)

I would be very grateful if someone experienced in T41 could advise if there is a big speed difference in using the heap v stack and if there are any other pitfalls of this approach?

Frank B · Sep 2, 2021

The heap is on the "slower" OCRAM. But it is cached, by a large 32KB data-cache which works really well, so it's not very likely that you even notice a speed difference.
I'm pretty sure your "list of objects" has a larger impact than the heap-detail.
But.. this thing runs with 600MHz. I don't think you have a feeling how fast that can be. .. it does also "dual issue", a huge amount of code gets executed paralell.. I think you will be surprised re:speed.
A rule of thumb is.. 600MHz = "900MHz" effective..
Of course every I/O is slow. Way slower. And if done in a not optimal way it will slow down your code - much much more than the stack / heap detail.

roncos · Sep 2, 2021

Thanks Frank. Regarding the IO design issues that you allude to. Are you speaking in general or T41 specific? Can you point me to relevant articles on this?

Frank B · Sep 2, 2021

Partly.
On T4, most of the I/O is connected through a slower bus at half (EDIT: or quarter) of CPU speed.
I don't know an article - i fear you need to consult the reference manual.
General: Avoid polling or waiting loops wherever possible.
As soon you wait the cpu speed is irrelevant. That stack/heap detail will disapear in the noise then, anyway.

But.. just try it. As said, you will be surprised.

KurtE · Sep 2, 2021

I second what Frank mentioned.

That is without a lot of additional information about your setup and code and the like, my guess is the speed differences between these two memory areas will most likely impact your code. Now if you were using PSRAM (memory you solder on bottom) maybe more so, but again the builtin hardware cache may hide most of this as well.

Bigger issues are how are you using the IO pins. For example: if you do a lot of analogRead() like operations, by definition these will be slow. As it will wait for the hardware to complete a full analog sample, conversion...

You can gain a lot of this back by instead using something like the ADC library, where for example you can tell the two different ADC modules to start an analog read operation and continue on, and then either sample back to see if it is done and/or use Interrupt and/or DMA.

Also if you need to read lots of Analog pins with as minimal overhead as possible, there are some additional capabilities in the hardware, that may not be fully exposed in the library.
I know a few of us experimented with the setup using the ADC subsystem along with the ADC_ETC module where you can actually sort of chain up Adc reads, for multiple pins.
Where you can set each of them up to for example Read pin X then Pin y, then Pin Z... and have them all stored out in memory one after the other...
This can also be setup to be done using DMA and driven by timers... But again I don't think we merged all of our playing around with this into the library...
But chapters 66-67 in RM give the details

defragster · Sep 2, 2021

Some notes on T_4.x's specific memory layout here : pjrc.com/store/teensy41.html#memory

As Frank Notes OCRAM / RAM2 / DMAMEM runs at one quarter speed of on chip RAM / RAM1 / DTCM, but has a full speed cache that can cover 32KB. It can be compile time allocated with these at global scope - not on stack::

Code:

DMAMEM int myData[100]; // allocate in RAM2 at compile time
 .versus.
int myData[100]; // allocate in RAM1 at compile time

As noted delay()'s are wasteful and I/O is slower. When using Digital I/O there are Fast() versions that are more efficient - but only when presented with a compile time 'constant' pin value, so using a ref table at runtime won't allow that usage. When Fast(constant pin#) is used the 'efficient' code is placed inline - so also avoids a function call.

Stack v Heap

roncos

New member

Frank B

Senior Member

roncos

New member

Frank B

Senior Member

KurtE

Senior Member+

defragster

Senior Member+