measure your teensy 3.x cpu and ram usage!

pix-os

Well-known member
for those wanting to measure their arduino's CPU/RAM usage:
Code:
//auto detect speed of your teensy 3.2
#if F_CPU == (24000000L) //24mhz
#define UNLOADED_IDLE_COUNTS 152000
#elif F_CPU == (48000000L) //48mhz
#define UNLOADED_IDLE_COUNTS 280000
#elif F_CPU == (72000000L) //72mhz
#define UNLOADED_IDLE_COUNTS 360000
#else //96mhz
#define UNLOADED_IDLE_COUNTS 420000
#endif

//set cpu percent scale to a linear one
#define LOG 0
#define LIN 1
#define SCALE LIN

static IntervalTimer cpu;

uint32_t Idle_Counter;
uint8_t CPU_Utilization_Info_Read_To_Compute;
uint32_t Prev_Idle_Counter;
uint32_t Idle_Counts;
uint32_t Calculate_Idle_Counts (void);


extern "C" char* sbrk(int incr);
int freeRam() {
  char top;
  return &top - reinterpret_cast<char*>(sbrk(0));
}


void setup()
{

  delay(1000);
  // initialize the serial communication at max speed:
  Serial.begin(250000);

  Serial.println("started");
  cpu.begin(Update_Task_Ready_Flags, 1000);

}
uint32_t Read_Idle_Counts(void)
{
  return Idle_Counts;
}
uint32_t Calculate_CPU_Utilization (uint32_t temp_counts)
{
  return 100 - ((100 * temp_counts) / UNLOADED_IDLE_COUNTS);
}
uint32_t Calculate_Idle_Counts (void)
{
  Idle_Counts = Idle_Counter - Prev_Idle_Counter;
  Prev_Idle_Counter = Idle_Counter;
  return Idle_Counts;
}
bool One_MS_Task_Ready;
bool Ten_MS_Task_Ready;
bool One_Hundred_MS_Task_Ready;
bool One_S_Task_Ready;
inline void One_MS_Task(void)
{

}
inline void Ten_MS_Task(void)
{
  //uncomment line below to put in some cpu load :)
  //delay(8);
}
inline void One_Hundred_MS_Task(void)
{

}
inline void One_S_Task(void)
{
  uint32_t idleCounts = Calculate_Idle_Counts();
#if SCALE == (LOG)
  uint8_t percent1 = (Calculate_CPU_Utilization(idleCounts)*2)/2;
  uint8_t percent2 = (Calculate_CPU_Utilization(idleCounts)*2)/2;
  uint8_t percent = (percent1*percent2)/100;
#elif SCALE == (LIN)
  uint8_t percent = Calculate_CPU_Utilization(idleCounts);
#endif
  //output percent to serial monitor
  Serial.print(F("CPU usage: "));
  Serial.print(percent);
  Serial.println(F("%"));
  Serial.print(F("Free RAM: "));
  Serial.println(freeRam());
}
inline void Run_Tasks(void)
{
  if(One_MS_Task_Ready)
  {
    One_MS_Task_Ready=0;
    One_MS_Task();
  }
  if(Ten_MS_Task_Ready)
  {
    Ten_MS_Task_Ready=0;
    Ten_MS_Task();
  }
  if(One_Hundred_MS_Task_Ready)
  {
    One_Hundred_MS_Task_Ready=0;
    One_Hundred_MS_Task();
  }
  if(One_S_Task_Ready)
  {
    One_S_Task_Ready=0;
    One_S_Task();
  }
}
void loop()
{
  Idle_Counter++; 
  Run_Tasks();
}
/* WARNING this function called from ISR */
void Update_Task_Ready_Flags(void)
{
  static uint32_t counter;
  One_MS_Task_Ready=1;
  counter++;
  if((counter%10)==0)
  {
    Ten_MS_Task_Ready=1;
  }
  if((counter%100)==0)
  {
    One_Hundred_MS_Task_Ready=1;;
  }
  if(counter == 1000)
  {
    One_S_Task_Ready=1;
    counter=0;
  }  
}

note: might not be 100% perfect, but does the job :)
 
Last edited:
If you want it to run as fast as possible, you want to eliminate the modulus operations on powers of 10. If you use powers of 2, the compiler can bit bit masking, but typically a divide/modulus is fairly slow.
 
but typically a divide/modulus is fairly slow.

Teensy 3.2 has fast 32 bit integer divide, for both signed and unsigned numbers. They're single instructions which execute in just 2 cycles. The compiler is smart about using the divide instruction and the multiply-substract instruction to efficiently implement modulus. There's no need to bother with powers of 2 on Teensy 3.2. If doing so requires even a slight amount of extra code, it's likely to end up slower than just using divide & modulus with regular non-power-of-2 integers.

Teensy LC lacks the divide instructions and other fancy math instructions, so there you'd want to structure things for powers of 2.
 
like
Code:
unsigned int x;

x = x % 10; // x mod 10
//versus
x = x & 7;   // x mod 8
x &= 255; // x mod 256

Old habits: I normally use a power of 2 and the and operator. Sometimes and with 1's complement, depending on purpose. Then the code is optimal no matter the target MCU.

Also, Isn't it true on the Cortex M3, M4 that using a uint16_t is no faster, maybe slower, than uint32_t ? Of course, 16 bit uses less RAM if the var is static.
 
Last edited:
In terms of divide latency, I didn't see any instruction latency information in the datasheet at pjrc.com (at least searching for latency and/or divide).

I found this Cortex-M4 manual: http://users.ece.utexas.edu/~valvano/EE345L/Labs/Fall2011/CortexM4_TRM_r0p1.pdf On page 3-5, it lists the latency for the integer sdiv and udiv instructions as being 2..12 (multiply is a single cycle instruction). The instruction is a variable number of cycles, and it uses an early out implementation. That means the modulus operation (divide, multiply, subtract) is 4..14 cycles (plus more if there are various pipelining artifacts in back to back instructions), compared to a single and immediate instruction. It is a lot better than many other processors (particularly those without a hardware divide instruction), but still if you are trying for the fastest possible speed, stick with modulus on unisigned types by powers of 2.

FWIW, the latency of most single precision floating point ops are also 1 cycle (except for the fused multiply/add type instructions), but floating point divide/square root is 14 cycles.
 
FWIW, the latency of most single precision floating point ops are also 1 cycle (except for the fused multiply/add type instructions), but floating point divide/square root is 14 cycles.
You're referring to M3/M4 that have FPUs. A future Teensy 3 is expected to have such.
 
well, it might not be perfect, but it's a rather easy indication of how to speed up your code :)

like i was continously checking if the wave was still playing, which consumed quite a bit cpu, changed that to every half second :)
same for button presses and button holds, changed that to every 20ms, works the same, but with a unnoticable slight latency now.
screen is now updating play info 1x per second instead of 8x per second (and when the gui needs an update) also note: not entire screen though.

this saved alot of cpu cycles!
before these 3 mods it was about 60% when playing a wav file, now it is <42% :D

serial3.png
 
Last edited:
update:

now i set the UNLOADED_IDLE_COUNTS to display a value of 1% when using the lin scale (was like 11-14%), this results in 0% on the log scale though.
in short: it's more accurate now :)
 
Hi,

I have some problems with my programm and it looks like cpu is at 100%. In order to fix it i want to measure the actual load in the first place but i'm not sure how to do it for teensy 4.1.

Any advices?

Kind regards
 
Teensy's and other similar type systems ALWAYS operate at 100%. The actuate your program ALL the time.

They are not like PC's with operating systems with pre-emptive multi-tasking where the PC constantly switches between running programs.

There is ONLY ONE program on the Teensy and it's yours.
 
This thread is 8 years old, back when Teensy 3.2 was the most powerful model.

I have some problems with my programm and it looks like cpu is at 100%.

If the info and code on this old thread didn't help, you're probably better off to start a new conversation specific to the problem you're experiencing on Teensy 4.1.

Please keep in mind we can't see your program or computer screen or hardware. If you don't tell us anything about your program or the specific things you're actually observing that lead you to believe CPU usage is a problem, please understand how impossible it is for us to help you.
 
Back
Top