Noob question: Measuring Teensy 4.1 CPU and RAM utilization

graydetroit

Well-known member
I'm an artist who fell into software engineering who also fell into hardware tinkering, so please forgive my ignorance here, but how would I go about logging the specific load on the Teensy 4.1's CPU and RAM? Is there a helper function that spits this sort of information out? While building my project, I'd like to make sure I don't have a CPU intensive process or a memory leak. Appreciate it.
 
Measuring RAM consumption is straightforward. Use the search feature to find threads about it, or type 'free ram site:forum.pjrc.com' into Google if you prefer.

"CPU load" is somewhat trickier.

Unlike other devices, Teensy is not using an operating system. There is no task scheduler, or concept of threads. One task runs as fast as possible, consuming all available execution resources except when an interrupt fires (if any do.) An example of an interrupt would be when you talk to the thing over USB.

You can set up one or more timers to run code at intervals, which is somewhat like having threads; and you can set up code to run if a hardware interrupt line is triggered. It is also possible to do some in-code profiling to measure how long something takes, so you can get a good idea of how much time is consumed.

A lot of code for this kind of device is set up to run in an infinite loop (which is forcefully encouraged by the Arduino paradigm) such that you have, say, five or ten things you want to do, and the loop() function calls them in sequence until forever.
 
Thanks so much for the reply with helpful info, that sort of low level explanation as to how it works under the hood was more what I was after.

Measuring RAM consumption is straightforward. Use the search feature to find threads about it, or type 'free ram site:forum.pjrc.com' into Google if you prefer.

Absolutely, and honestly apologies for posting this thread. A cursory google search seems to have provided me with just what I was looking for: https://forum.pjrc.com/threads/31664-measure-your-teensy-3-x-cpu-and-ram-usage
 
I have a related question: How many blocks on a T4.1? My patch is presently using upward of 256 blocks, can I go to 400 or 500?
 
An audio block is 256 bytes, correct? 128 16 bit samples? So 512 blocks would eat up essentially half available ram? Is that on top of the ram the compiler tells me I am using?
 
I have a related question: How many blocks on a T4.1? My patch is presently using upward of 256 blocks, can I go to 400 or 500?

Sorry I am not sure what you mean by blocks and Patch... I am guessing your sketch is taking about 256KB?

As Defragster mentioned, the T4 product page has a lot of good memory information.

Before that I started a thread: https://forum.pjrc.com/threads/57326-T4-0-Memory-trying-to-make-sense-of-the-different-regions
that also tries to discuss some of the issues about the different memory regions.

As for 400, 500 it is really hard to say what is happening without seeing code.

For example: You have the 512kb of memory RAM1, which is used by is used to both keep faster copy of most of the code as well as all of your normal program variables and the stack.

Now if you have a bunch of area that you need to store things that is not constant or the like, you can place those variables in the RAM2, which you can control, by declaring a variable as DMAMEM:
or by using memory allocate function (malloc or new).

Or if you have bunch of read only tables and the like, if you mark them like: const uint8_t my_table[] PROGMEM = {....};
or the like, these memory allocations will stay in the flash memory.

Also back to code. As mentioned all code by default is brought down into the lower 512KB space, in 32kb chunks. That is if your code is 33KB it will take 64KB away from variables.
You can tell some functions to stay up in flash by marking the functions with the keyword FLASHMEM

Note: The T4 page has not been updated to say FLASHMEM in this case instead of PROGMEM.

Hope that helps.
 
Thanks. I'm really trying to keep track of all this, I think I am getting it. So if I have arrays that need to be fast but not one clock fast I can put them in the upper 512KB? What about using that 512KB for the delay line? It's eating up most of my blocks right now.
 
The Teensy 4.x series has 1MB onboard RAM, of which 512K is "tightly coupled" and the remaining 512K is on the system bus (a bit slower.)

It also has support for 8 or 16MB of external RAM. If you want to hold loads of sample data, that would be a good choice. Looking at my .s3m files (tracked music) I see many between 512K and 1MB, so if you are using a similar footprint of patches, that would get a little tight if you tried to shoehorn them into either of the 512KB segments. On the other hand, 8MB is a yawning chasm, and it's all in the same address space. To my knowledge, this goes over QSPI. This is not super duper fast, but 10MHz * 4 bits wide gives you a bandwidth of 5 million bits per second (naively). I would think you could mix 32-64 voices in real time, even with that limitation.

This is based on a very cursory investigation, and my numbers may be wrong. Someone else can chime in if they know better.
 
One easy way to track some idea of the running performance of loop() is a simple count updated once a second.

Below is a piece of code showing the use of elapseMicros() in loop2() {lower res elapsedMillis is faster} or using the cycle counter in setup() and loop() for comparison's sake. It also has a void yield(){} to remove that calling overhead between loop calls.
Code:
void yield() {} // make this void if no serialEvent() code is used.

uint32_t changeCycles;
elapsedMicros loop2T;
uint32_t l_cnt;
void setup() {
	Serial.begin(115200);
	while (!Serial && millis() < 4000 );
	if ( ARM_DWT_CYCCNT == ARM_DWT_CYCCNT ) { // Enable CPU Cycle Count on T_3.x
		ARM_DEMCR |= ARM_DEMCR_TRCENA;
		ARM_DWT_CTRL |= ARM_DWT_CTRL_CYCCNTENA;
	}
	Serial.println("\n" __FILE__ " " __DATE__ " " __TIME__);
	l_cnt = 0;
	changeCycles = ARM_DWT_CYCCNT;
	while ( ARM_DWT_CYCCNT - changeCycles < F_CPU_ACTUAL ) l_cnt++;
	Serial.printf( " loop    # \t=%u\t", l_cnt);
	Serial.printf( "setup() while() loop\n" );

	l_cnt = 0;
	loop2T = 0;
	changeCycles = ARM_DWT_CYCCNT;
	while ( ARM_DWT_CYCCNT - changeCycles < F_CPU_ACTUAL )  loop2();
	if ( l_cnt ) Serial.printf( " loop2() # \t= %u  \t", l_cnt);
	Serial.printf( "setup() called loop2()\n\n");

	l_cnt = 0;
	uint32_t changeCycles2;
	changeCycles = ARM_DWT_CYCCNT;
	changeCycles2 = ARM_DWT_CYCCNT;
	while ( ARM_DWT_CYCCNT - changeCycles2 < F_CPU_ACTUAL ) loop();
	if ( l_cnt ) Serial.printf( " loop    # \t= %u  \t", l_cnt);
	Serial.printf( "setup() called loop\n\n");
	l_cnt = 0;
	changeCycles = ARM_DWT_CYCCNT;
}

void loop() {
	l_cnt++;
	if ( ARM_DWT_CYCCNT - changeCycles >= F_CPU_ACTUAL ) {
		Serial.printf( " loop #=%u  \n", l_cnt);
		l_cnt = 0;
		changeCycles = ARM_DWT_CYCCNT;
		while ( millis() > 5000 ) asm volatile( "wfi" );
	}
}

void loop2() {
	l_cnt++;
	if ( loop2T > 1000000 ) {
		Serial.printf( " elapsedMicros loop # \t=%u  \n", l_cnt);
		l_cnt = 0;
		loop2T = 0;
	}
}

The output here on a T_4.0 looks like this - with manually added commas to better show magnitude:
Code:
T:\tCode\TIME\TeensyTime\TeensyTime.ino Oct  1 2020 00:53:22
 loop    # 	=119,993,653	setup() while() loop
 loop2() # 	= 13,042,786  	setup() called loop2()

 loop #=29,998,395  
setup() called loop

 loop #=16,215,383  
 loop #=16,215,384
 
The Teensy 4.x series has 1MB onboard RAM, of which 512K is "tightly coupled" and the remaining 512K is on the system bus (a bit slower.)

It also has support for 8 or 16MB of external RAM. If you want to hold loads of sample data, that would be a good choice. Looking at my .s3m files (tracked music) I see many between 512K and 1MB, so if you are using a similar footprint of patches, that would get a little tight if you tried to shoehorn them into either of the 512KB segments. On the other hand, 8MB is a yawning chasm, and it's all in the same address space. To my knowledge, this goes over QSPI. This is not super duper fast, but 10MHz * 4 bits wide gives you a bandwidth of 5 million bits per second (naively). I would think you could mix 32-64 voices in real time, even with that limitation.

This is based on a very cursory investigation, and my numbers may be wrong. Someone else can chime in if they know better.

I had thought of using samples and then a 6 or 7 second (max) delay, but for now only a short delay is needed. I have the 16MB RAM installed on this board, and it passes that memory test with flying colors, but I have no idea how to access it. I know when I try making something occupy that space I get back bad data.
 
I had thought of using samples and then a 6 or 7 second (max) delay, but for now only a short delay is needed. I have the 16MB RAM installed on this board, and it passes that memory test with flying colors, but I have no idea how to access it. I know when I try making something occupy that space I get back bad data.

Not sure what the 'try' and 'bad data' might include without sample code.

The PSRAM RAM space is there with :: EXTMEM char mySample[4285];
> However the data has to be manually transferred there - at runtime from SD Card or flash memory as it is not initialized in any way in build or startup.

Data in flash at compile time :: PROGMEM char mySampleFlash[4285] = { 0xa, 0xb, ... };

Then mem copy mySampleFlash >> mySample in setup.
 
I had thought of using samples and then a 6 or 7 second (max) delay, but for now only a short delay is needed. I have the 16MB RAM installed on this board, and it passes that memory test with flying colors, but I have no idea how to access it. I know when I try making something occupy that space I get back bad data.
I would try mimicking what the example code does. As Defragster says, the most logical way is to copy in from an SD card. It is possible that you could bake samples directly into the code, but that would require external tooling and be harder to work with than putting the files on an SD card.

The C way of doing this would be to malloc() memory on the PSRAM equal to the size of a patch, and then save the pointer to that along with the size in a struct. The C++ way would be to create an object that does the same thing; and which provides whatever functions might be directly useful, such as returning the pointer or the length of the patch. Such a class would make sure that external code always knows how to get the audio, and how long it is. It should know and do nothing else.
 
Back
Top