T4.x (maybe other) reduce serialEventX overhead on system (worth it?)

KurtE

Senior Member+
While I was hacking (carefully crafting) changes to how yield worked as per the other thread,

I came back to the code for T4.x where I would blindly enable yield to call off to the serialEventX function on every
Serial port you did a begin on, and then I would remove the call on the first time our default implementation was called as it served no useful function.

I did that because I did not know a way to find out if the sketch has provided their own serialEvent implementation.

Example sketches that do things like:
Code:
void serialEvent1() {
    while (Serial1.available()) Serial.write(Serial1.read());
}
But I think I figured out a way of detecting this...

What I just tried was to extract the default implementation for serialEvent1 out of HardwareSerial1.cpp and put it into a new file
serialEvent1.cpp.

Now in HardwareSerial.cpp I add the following:
Code:
//void serialEvent1() __attribute__((weak));
//void serialEvent1() {Serial1.disableSerialEvents(); }		// No use calling this so disable if called...
uint8_t serialEvent1_default __attribute__((weak)) PROGMEM = 0 ;

and in the new serialEvent1.cpp the file has:
Code:
#include <Arduino.h>
#include "HardwareSerial.h"
void serialEvent1() __attribute__((weak));
void serialEvent1() {Serial1.disableSerialEvents(); }		// No use calling this so disable if called...
uint8_t serialEvent1_default PROGMEM = 1;

So I tried adding to simple sketch:
Code:
  extern const uint8_t serialEvent1_default;
  Serial.printf("Default serialEvent1? %d\n", serialEvent1_default);
Which did not have a serialEvent1 and printed 1, when I add it it printed 0...

So I can create 9 new files like this Serial1-8 plus the USB 1, change my begin method to check for a flag like this and only setup to call the serialEvent if the users code actually makes use of it.

Does this make sense? Worth it?
 
If that lets yield() processing more quickly do less or nothing when those things are not in user code that is very cool.

What dies the resultant yield() processing look like?
 
I added the changes to the branch: https://github.com/KurtE/cores/tree/eventResponder_reduce_overhead
That I have a PR on. I probably should have someone (including myself) build it on linux or mac to make sure file name stuff work OK.

It appears like build file order makes differences in if a ((weak)) variable is used or another may be brought in...

Also the changes now include the Serial changes for XBAR pins (different PR) as I did not want to have myself or Paul if he merges them in. have to
resolve conflicts.

But with these changes. If you have simple sketches, that don't implement any serialEvent like functions and don't use any eventResponder objects that are setup to be called on yield. Then yield reduces to:
Code:
void yield(void) __attribute__ ((weak));
void yield(void)
{
	static uint8_t running=0;
	if (!yield_active_check_flags) return;	// nothing to do
	if (running) return; // TODO: does this need to be atomic?
	running = 1;


	// USB Serail - Add hack to minimize impact...
	if (yield_active_check_flags & YIELD_CHECK_USB_SERIAL) {
		if (Serial.available()) serialEvent();
		if (_serialEvent_default) yield_active_check_flags &= ~YIELD_CHECK_USB_SERIAL;
	}

	// Current workaround until integrate with EventResponder.
	if (yield_active_check_flags & YIELD_CHECK_HARDWARE_SERIAL) HardwareSerial::processSerialEvents();

	running = 0;
	if (yield_active_check_flags & YIELD_CHECK_EVENT_RESPONDER) EventResponder::runFromYield();
	
};
where yield_active_check_flags will be 0.
So yield will simply do one if and return.

From the other thread on eventResponder. The test was run again with some stuff printing out properly.
Code:
SPI Test program

Default serialEvent? 1 1
Press any key to run test

start test yield_active_check_flags 0
  systick ISR: 2159
Start: 35

Test Immediate: 0 2159
After Immediate: 35

Test yield: 4 2159
After yield: 55

Test Interrupt: 4 2179
After Interrupt: 56

Press any key to run test
Where I am printing out the value of yield_active_check_flags
I then test how many microseconds calling yield takes.

So at start of test where yield does nothing : 35
If I have setup an eventResponder that is called on yield: 55
...
 
Thanks,

Again not sure how much it is worth it, as I don't know anyone who cares about the yield and eventResponder overhead ... So again probably waste of time, but I thought I would give an hour or two to see if I can bring some of it back in to the Teensy3 branch...

I hacked up same sketch to disable printing some of the T4 specific (and change stuff) information and ran on T3.5 ...
Some of the timing are a bit different than T4.1 I just tested on .

Code:
SPI Test program

Press any key to run test

Start: 2466

Test Immediate: 0 2859
After Immediate: 2466

Test yield: 0 2859
After yield: 2474

Test Interrupt: 0 2859
After Interrupt: 2472

Press any key to run test
So with my last test on T4.1 calling yield 1000 times took: Start: 35
Starting off T3.5 Took: Start: 2466

So maybe an area for some slight improvements.

Also starting code/data sizes
Code:
"C:\\arduino-1.8.12\\hardware\\teensy/../tools/arm/bin/arm-none-eabi-size" -A "C:\\Users\\kurte\\AppData\\Local\\Temp\\arduino_build_604884/SPI_test_eventResponder.ino.elf"
Sketch uses 36720 bytes (7%) of program storage space. Maximum is 524288 bytes.
Global variables use 5360 bytes (2%) of dynamic memory, leaving 256776 bytes for local variables. Maximum is 262136 bytes.
C:\arduino-1.8.12\hardware\teensy/../tools/teensy_post_compile -file=SPI_test_eventResponder.ino -path=C:\Users\kurte\AppData\Local\Temp\arduino_build_604884 -tools=C:\arduino-1.8.12\hardware\teensy/../tools -board=TEENSY35 -reboot -port=usb:0/140000/0/1/1/1 -portlabel=hid#vid_16c0&pid_0478 Bootloader -portprotocol=Teensy

Now some quick hacking
 
Not sure I'm making sense of the test numbers?

One simple real test would be monitor of count of loop() per second:
>> Yield as it was
>> User sketch :: void yield() {}
>> Yield as it is when no serialEvent() in use?

That will result in it being called as it normally does and result in some 100K's to Millions of loops()'s per second unique to each Teensy.
 
The test numbers are simply:
Code:
void TimeYieldCalls(const char *sz) {
  yield();
  Serial.print(sz); Serial.flush();
  elapsedMicros em = 0;
  for (uint32_t i = 0; i < 1000; i++) yield();
  uint32_t elapsed = em;
  Serial.print(": ");
  Serial.println(elapsed, DEC);
  Serial.flush();
}
So with my updates for T4.x with sketch that does not do any serialEvent stuff and does not do eventResponder.attach(&my_func);
which adds this event object to a list of ones called by yield... Without these in sketch 100 calls to yield took 35us and current T3.5 unchanged took 2466us.

So now working on a version of the T3.x and I assume T-LC will fall out. that should again in this case end up at simply logically do.

void yield(void)
{
static uint8_t running=0;
if (!yield_active_check_flags) return; // nothing to do
...
}
where in this case yield_active_check_flags will be 0 and return.

But with my hacking, I hopefully will also get to a version that with T3.x and T-LC you only get the code and data objects brought into your sketch for only those objects you actually use. So for example if your sketch on T3.5 only uses Serial1, then you will not have the space penalty for Serial2-6...
Although not sure how much that is saving. So far with my hacking of only using Serial1, it reduced data size by about 580 bytes.
Which is probably not too far off.

My current run on T3.5 is a lot better speed wise...
Code:
SPI Test program

Press any key to run test

Start: 212

Test Immediate: 0 26cd
After Immediate: 212

Test yield: 4 26cd
After yield: 395

Test Interrupt: 4 26dd
After Interrupt: 390
Obviously not as fast as T4.1 but ...

Will push up changes in the morning.
 
That makes more sense of the numbers and where the state of progress was for 4.x and 3.x.

Testing below on stock TD 1.52 - will wait for github update for 3.x and make sure 4.x is settled.

Below are runs of this sketch - Above TimeYieldCalls() - with and without void yield() and loopCnt.
>> manual line 1 edit for Teensy ###
>> line 2 :: #if 0 or 1
>> Line 4&6 :: change "TD 1.52" when cores changed ( or edit line 1 text )
Code:
const char szTeensy[] = "Teensy 4.1";
#if 0
void yield() {}
const char szTest[] = "TD 1.52 :: PRIVATE  yield() :: setup Test:";
#else
const char szTest[] = "TD 1.52 setup Test:";
#endif

elapsedMillis loopTime;
uint32_t loopCnt = 0;
void setup() {
  while (!Serial) ; // wait
  TimeYieldCalls( szTest );
  Serial.println("\n" __FILE__ " " __DATE__ " " __TIME__);
  Serial.println(szTeensy);
  TimeYieldCalls( szTest );
  loopTime = 0;
}

void loop() {
  if ( loopTime >= 1000 ) {
    loopTime -= 1000;
    Serial.printf("loop's per sec = %lu\n", loopCnt);
    loopCnt = 0;
  }
  loopCnt++;
}

void TimeYieldCalls(const char *sz) {
  yield();
  Serial.print(sz); Serial.flush();
  elapsedMicros em = 0;
  for (uint32_t i = 0; i < 1000; i++) yield();
  uint32_t elapsed = em;
  Serial.print(": ");
  Serial.println(elapsed, DEC);
  Serial.flush();
}


Code:
TD 1.52 :: PRIVATE  yield() :: setup Test:: 1

T:\tCode\Serial\YieldTest\YieldTest.ino May 20 2020 23:28:11
[B]Teensy 4.1
TD 1.52 :: PRIVATE  yield() :: setup Test:: 0[/B]
loop's per sec = 17141019
loop's per sec = 17141119
loop's per sec = 17141119
loop's per sec = 17141120
Code:
TD 1.52 setup Test:: 64

T:\tCode\Serial\YieldTest\YieldTest.ino May 20 2020 23:28:37
[B]Teensy 4.1
TD 1.52 setup Test:: 64[/B]
loop's per sec = 11761913
loop's per sec = 11763478
loop's per sec = 11763477
loop's per sec = 11763478

Teensy 3.6::
Code:
TD 1.52 :: PRIVATE  yield() :: setup Test:: 0

T:\tCode\Serial\YieldTest\YieldTest.ino May 20 2020 23:35:06
[B]Teensy 3.6
TD 1.52 :: PRIVATE  yield() :: setup Test:: 1[/B]
loop's per sec = 3909437
loop's per sec = 3909477
loop's per sec = 3909493
loop's per sec = 3909500
Code:
TD 1.52 setup Test:: 847

T:\tCode\Serial\YieldTest\YieldTest.ino May 20 2020 23:35:50
[B]Teensy 3.6
TD 1.52 setup Test:: 849[/B]
loop's per sec = 940876
loop's per sec = 941538
loop's per sec = 941541
loop's per sec = 941544

Teensy 3.1::
Code:
TD 1.52 :: PRIVATE  yield() :: setup Test:: 0

T:\tCode\Serial\YieldTest\YieldTest.ino May 20 2020 23:39:03
[B]Teensy 3.1
TD 1.52 :: PRIVATE  yield() :: setup Test:: 0
[/B]loop's per sec = 2081188
loop's per sec = 2081255
loop's per sec = 2081258
loop's per sec = 2081261
Code:
TD 1.52 setup Test:: 1425

T:\tCode\Serial\YieldTest\YieldTest.ino May 20 2020 23:39:23
[B]Teensy 3.1
TD 1.52 setup Test:: 1431
[/B]loop's per sec = 442589
loop's per sec = 442984
loop's per sec = 442984
loop's per sec = 442984

Teensy LC:
Code:
TD 1.52 :: PRIVATE  yield() :: setup Test:: 1

T:\tCode\Serial\YieldTest\YieldTest.ino May 20 2020 23:45:15
[B]Teensy LC
TD 1.52 :: PRIVATE  yield() :: setup Test:: 1
[/B]loop's per sec = 851889
loop's per sec = 851922
loop's per sec = 852000
loop's per sec = 851994
Code:
TD 1.52 setup Test:: 3858

T:\tCode\Serial\YieldTest\YieldTest.ino May 20 2020 23:44:55
[B]Teensy LC
TD 1.52 setup Test:: 3876
[/B]loop's per sec = 202910
loop's per sec = 203055
loop's per sec = 203073
loop's per sec = 203072
 
Last edited:
Wondered how many cycles in and out of loop() - runs, but CYCCNT won't work on T_LC:
Code:
const char szTeensy[] = "Teensy 4.1";
#if 1
void yield() {}
const char szTest[] = "TD 1.52 :: PRIVATE  yield() :: setup Test:";
#else
const char szTest[] = "TD 1.52 setup Test:";
#endif

elapsedMillis loopTime;
uint32_t loopCnt = 0;
uint32_t loopACC[16];
uint32_t yieldACC[16];
uint32_t lastACC = 0;

void setup() {
  if ( ARM_DWT_CYCCNT == ARM_DWT_CYCCNT ) {
    // Enable CPU Cycle Count
    ARM_DEMCR |= ARM_DEMCR_TRCENA;
    ARM_DWT_CTRL |= ARM_DWT_CTRL_CYCCNTENA;
  }
  while (!Serial) ; // wait
  TimeYieldCalls( szTest );
  Serial.println("\n" __FILE__ " " __DATE__ " " __TIME__);
  Serial.println(szTeensy);
  TimeYieldCalls( szTest );
  loopTime = 0;
}

void loop() {
  yieldACC[loopCnt & 0xF] = ARM_DWT_CYCCNT - lastACC;
  lastACC = ARM_DWT_CYCCNT;
  if ( loopTime >= 1000 ) {
    for ( int ii = 1; ii <= 0xf; ii++ ) {
      loopACC[0] += loopACC[ii];
      yieldACC[0] += yieldACC[ii];
    }
    Serial.printf("loop's per sec = %lu\t", loopCnt);
    Serial.printf("ARM_Cycles's in loop = %lu\t in yield = %lu\n\n", loopACC[0]/16, yieldACC[0]/16);
    loopCnt = 0;
    loopTime = 0;
  }
  loopCnt++;
  loopACC[loopCnt & 0xF] = ARM_DWT_CYCCNT - lastACC;
  lastACC = ARM_DWT_CYCCNT;
}

void TimeYieldCalls(const char *sz) {
  yield();
  Serial.print(sz); Serial.flush();
  elapsedMicros em = 0;
  for (uint32_t i = 0; i < 1000; i++) yield();
  uint32_t elapsed = em;
  Serial.print(": ");
  Serial.println(elapsed, DEC);
  Serial.flush();
}

Shows how little 'simple' loop() does - and how few cycles difference are in [return/ run yield()/ call loop()] for the void and current yield(), but it adds up ... about 60 million cycles per second for current yield with extra loop()'s.
>> Even adding the CYCCNT tracking code slowed it down and loses cycles per loop recording the counts
Code:
TD 1.52 :: PRIVATE  yield() :: setup Test:: 1

T:\tCode\Serial\YieldTest\YieldTest.ino May 21 2020 02:12:48
[B]Teensy 4.1
TD 1.52 :: PRIVATE  yield() :: setup Test:: 0[/B]
loop's per sec = 10525174	ARM_Cycles's in loop = 8	 in yield = 40

loop's per sec = 10525203	ARM_Cycles's in loop = 11	 in yield = 37  [B]// accounts for 505209744 cycles[/B]

loop's per sec = 10525196	ARM_Cycles's in loop = 8	 in yield = 40

loop's per sec = 10525203	ARM_Cycles's in loop = 8	 in yield = 37
Code:
TD 1.52 setup Test:: 64

T:\tCode\Serial\YieldTest\YieldTest.ino May 21 2020 02:13:11
[B]Teensy 4.1
TD 1.52 setup Test:: 64
[/B]loop's per sec = 8217224	ARM_Cycles's in loop = 8	 in yield = 53

loop's per sec = 8218294	ARM_Cycles's in loop = 8	 in yield = 56   [B]// accounts for 525970816 cycles[/B]

loop's per sec = 8218294	ARM_Cycles's in loop = 8	 in yield = 56

loop's per sec = 8218294	ARM_Cycles's in loop = 8	 in yield = 56

Note:
static inline void yield() {} // runs the same fast 37 to 40 cycles so the build does this it seems.
inline void yield() {} // runs the same slower 56 cycles as calling the current PJRC yield()
 
Thanks @defragster - I just pushed up the changes for the teensy3 branch. I did compile my test sketch for some of this on T3.5 and ran it, plus compiled for 3.2 and then compiled for LC and ran it.

So I think everything appears to be working.
 
Note: Mostly talking to self ;)

Thinking of doing a quick cleanup on T3.x/LC code I did yesterday, trying to decide which way is cleaner/faster.

Currently I have code that populates an array by Serial object index with an individual callback function for each of the Serial objects. And then now yield does:


Code:
...
	if (yield_active_check_flags & YIELD_CHECK_HARDWARE_SERIAL) {
		if (serial_event_handler_checks[0]) (*serial_event_handler_checks[0])();
		if (serial_event_handler_checks[1]) (*serial_event_handler_checks[1])();
		if (serial_event_handler_checks[2]) (*serial_event_handler_checks[2])();
#ifdef HAS_KINETISK_UART3
		if (serial_event_handler_checks[3]) (*serial_event_handler_checks[3])();
#endif
#ifdef HAS_KINETISK_UART4
		if (serial_event_handler_checks[4]) (*serial_event_handler_checks[4])();
#endif
#if defined(HAS_KINETISK_UART5) || defined (HAS_KINETISK_LPUART0)
		if (serial_event_handler_checks[5]) (*serial_event_handler_checks[5])();
#endif
	}
First I am going to move the callback code over to HardwareSerial code as cleaner...
I am also thinking of having the array populated by simply adding any ones that have user call back functions to list.
So simple loop, calling each one...

Also thinking of adding the add to list, and processing to the HardwareSerial class, and maybe add pointer to their serialEvent function to the constructor.
So have one simple callback member function that does: if(available()) (*_serialEvent)();

Instead of individual ones.

Side note: the T3.x core has support for serialEventUSB1 and serialEventUSB2, but does not currently have this... I see an extern defined for both but not anything in yield.
Maybe add in
 
Pulled the few hours old CORES to 2nd IDE 1.8.12 folder.

LOOKS AWESOME KURT!!!! That set of QUICK EXIT changes to yield processing looks to MATCH user added private 'void yield()' !!! :)

Testing PR code with two versions of loop() above shows great results. Did not actually test with use of serialEvents() but to compare to above:

Code:
TD 1.52 :: PRIVATE  yield() :: setup Test:: 0

T:\tCode\Serial\YieldTest\YieldTest.ino May 21 2020 12:26:39
[B]Teensy 4.1 :: CORES event PR
TD 1.52 :: PRIVATE  yield() :: setup Test:: 1
[/B]loop's per sec = [COLOR="#FF0000"][B]10525718	[/B][/COLOR]ARM_Cycles's in loop = 8	 in [B]yield = 37[/B]

loop's per sec = 10525695	ARM_Cycles's in loop = 8	 in yield = 38
Code:
TD 1.52 setup Test:: 37

T:\tCode\Serial\YieldTest\YieldTest.ino May 21 2020 12:27:14
[B]Teensy 4.1 :: CORES event PR
TD 1.52 setup Test:: 35
[/B]loop's per sec = [COLOR="#FF0000"][B]10524895	[/B][/COLOR]ARM_Cycles's in loop = 8	 in [B]yield = 38[/B]

loop's per sec = 10525716	ARM_Cycles's in loop = 8	 in yield = 38

Then changing to prior non-CYCCNT version of loop()::
Code:
TD 1.52 setup Test:: 35

T:\tCode\Serial\YieldTest\YieldTest.ino May 21 2020 12:29:04
[B]Teensy 4.1 :: CORES event PR
TD 1.52 setup Test:: 35[/B]
loop's per sec = [B][COLOR="#FF8C00"]17644653[/COLOR][/B]
loop's per sec = 17646083
Code:
TD 1.52 :: PRIVATE  yield() :: setup Test:: 0

T:\tCode\Serial\YieldTest\YieldTest.ino May 21 2020 12:29:29
[B]Teensy 4.1 :: CORES event PR
TD 1.52 :: PRIVATE  yield() :: setup Test:: 1
[/B]loop's per sec = [B][COLOR="#FF8C00"]17141903[/COLOR][/B]
loop's per sec = 17141920

And the same runs on a T_3.6 - doesn't recover the gains quite as well ... but better!:
Code:
TD 1.52 :: PRIVATE  yield() :: setup Test:: 0

T:\tCode\Serial\YieldTest\YieldTest.ino May 21 2020 12:42:13
[B]Teensy 3.6 :: CORES event PR
TD 1.52 :: PRIVATE  yield() :: setup Test:: 0
[/B]loop's per sec = [B]2067462[/B]	ARM_Cycles's in loop = 25	 in yield = 44

loop's per sec = 2067448	ARM_Cycles's in loop = 26	 in yield = 43

[U]TD 1.52 setup Test:: 141[/U]

T:\tCode\Serial\YieldTest\YieldTest.ino May 21 2020 12:42:32
[B]Teensy 3.6 :: CORES event PR
TD 1.52 setup Test:: 141
[/B]loop's per sec = [B]1712587[/B]	ARM_Cycles's in loop = 26	 in yield = 61

loop's per sec = 1713034	ARM_Cycles's in loop = 25	 in yield = 62
Code:
[U]TD 1.52 :: PRIVATE  yield() :: setup Test:: 0[/U]

T:\tCode\Serial\YieldTest\YieldTest.ino May 21 2020 12:43:08
[B]Teensy 3.6 :: CORES event PR
TD 1.52 :: PRIVATE  yield() :: setup Test:: 0[/B]
loop's per sec = [B]3910247[/B]
loop's per sec = 3910269
[U]TD 1.52 setup Test:: 140[/U]

T:\tCode\Serial\YieldTest\YieldTest.ino May 21 2020 12:43:26
[B]Teensy 3.6 :: CORES event PR
TD 1.52 setup Test:: 141
[/B]loop's per sec = [B]2809698[/B]
loop's per sec = 2810521
 
Still doing some testing.

Run into issue with T4 branch that Serial2-x appear to always think it is using user specific event handler.

The T3.x branch appears to work properly... So investigating.

Saw the above on both Windows and MAC.
 
@Kurt - did you make a #ifdef''d sketch that iterates the valid : if (serial_event_handler_default) Serial.print( "port USER Event" ); else Serial.print( "port NULL" );

oppps - found that is private ...
 
Last edited:
I just pushed up a fix for the T4.x... Will double verify it with the MAC, that it builds correctly there. Note: I changed some of the T4 processing to be more in line with T3.x code. Where instead of checking all 8 items for NULL to see if we should check code. I instead just populate the array with the active ones and a count. Also got rid of special function per Serial port and instead have the
Hardware structure keep link to the event function and then add member function which does a call to available if true then call through pointer to the event function...

I noticed when I had Serial2 with begin that yield flags I mentioned earlier was not 0... SO I wondered what was going on. SO hacked up test sketch a bit more like:
Code:
void setup() {
  pinMode(CS_PIN, OUTPUT);
  digitalWriteFast(CS_PIN, HIGH);
  while (!Serial && millis() < 4000) ;  // wait for Serial port
  Serial.begin(115200);
  SPI.begin();
  Serial.println("SPI Test program");
  Serial1.begin(2000000);
  Serial2.begin(2000000);
  Serial3.begin(2000000);
  extern const uint8_t _serialEvent_default;
  extern const uint8_t _serialEvent1_default;
  extern const uint8_t _serialEvent2_default;
  extern const uint8_t _serialEvent3_default;
  Serial.printf("Default serialEvent? %d %d %d %d\n", _serialEvent_default, 
      _serialEvent1_default,_serialEvent2_default,_serialEvent3_default);
#if defined(__IMXRT1062__)
  Serial4.begin(2000000);
  Serial5.begin(2000000);
  Serial6.begin(2000000);
  Serial7.begin(2000000);
  //Serial8.begin(2000000);
  extern const uint8_t _serialEvent4_default;
  extern const uint8_t _serialEvent5_default;
  extern const uint8_t _serialEvent6_default;
  extern const uint8_t _serialEvent7_default;
//  extern const uint8_t _serialEvent8_default;
  Serial.printf("    %d %d %d %d\n", _serialEvent4_default,
      _serialEvent5_default,_serialEvent6_default,_serialEvent7_default);
#endif
...
}
And it was showing the default flags as 0 ...
Now all show default...

Back to MAC

Update: Works on MAC
 
Last edited:
Starting a sketch with JUMPER plugs on all T_4.1 Serial#'s. So far all 8 ports work without Event checking.

Every second it prints 'ii' from loop to each Serial#[1-8]
Every 2+ seconds if reads from each Serial# if ->available() to Serial.
Will add serialEvent#()'s it it will be read from there.

Initial results before adding any serialEvent#()'s:
Code:
Sketch uses 38032 bytes (0%) of program storage space. Maximum is 8126464 bytes.
Global variables use 49844 bytes (9%) of dynamic memory, leaving 474444 bytes for local variables. Maximum is 524288 bytes.


Code:
T:\tCode\Serial\SerialEventsTest\SerialEventsTest.ino May 22 2020 16:55:20
Teensy 4.1 :: CORES event PR
TD 1.52 :: PRIVATE  yield() :: setup Test:

loop's per sec = 9818908
loop's per sec = 14633190
loop's per sec = 14633187
0] 0	0] 0	0] 0	
1] 1	1] 1	1] 1	
2] 2	2] 2	2] 2	
3] 3	3] 3	3] 3	
4] 4	4] 4	4] 4	
5] 5	5] 5	5] 5	
6] 6	6] 6	6] 6	
7] 7	7] 7	7] 7

And NO PRIVATE yield() :: Sketch uses 38208 bytes (0%) of program storage space. Maximum is 8126464 bytes.
T:\tCode\Serial\SerialEventsTest\SerialEventsTest.ino May 22 2020 16:56:31

Teensy 4.1 :: CORES event PR

TD 1.52 setup Test:

loop's per sec = 8808613
loop's per sec = 13635471
loop's per sec = 13635468
0] 0 0] 0 0] 0
 
Code behaving and working as expected!!!

It is best not to use serialEvent() processing! But this improves the situation.

Code below for this and prior post. Adding a single :: void serialEvent1() { readMe( 0 ); }
Sketch uses 38336 bytes (0%) of program storage space. Maximum is 8126464 bytes.
Global variables use 49844 bytes (9%) of dynamic memory, leaving 474444 bytes for local variables. Maximum is 524288 bytes.

loop() count drops:
Code:
T:\tCode\Serial\SerialEventsTest\SerialEventsTest.ino May 22 2020 17:08:28

Teensy 4.1 :: CORES event PR

TD 1.52 setup Test:

loop's per sec = 5241743
0>> 0	
loop's per sec = 7894202
0>> 0	
loop's per sec = 7894202
1] 1	1] 1	1] 1	
2] 2	2] 2	2] 2	
3] 3	3] 3	3] 3	
4] 4	4] 4	4] 4	
5] 5	5] 5	5] 5	
6] 6	6] 6	6] 6	
7] 7	7] 7	7] 7	
0>> 0	
loop's per sec = 7893886
0>> 0	
loop's per sec = 7894202
0>> 0	
loop's per sec = 7894202
1] 1	1] 1	1] 1	
2] 2	2] 2	2] 2	
3] 3	3] 3	3] 3	
4] 4	4] 4	4] 4	
5] 5	5] 5	5] 5	
6] 6	6] 6	6] 6	
7] 7	7] 7	7] 7

All eight serialEvent():
Sketch uses 38336 bytes (0%) of program storage space. Maximum is 8126464 bytes.
Global variables use 49844 bytes (9%) of dynamic memory, leaving 474444 bytes for local variables. Maximum is 524288 bytes.

Code:
T:\tCode\Serial\SerialEventsTest\SerialEventsTest.ino May 22 2020 17:14:27

Teensy 4.1 :: CORES event PR

TD 1.52 setup Test:

loop's per sec = 2077276
0>> 0	1>> 1	2>> 2	3>> 3	4>> 4	5>> 5	6>> 6	7>> 7	loop's per sec = 3225545
0>> 0	1>> 1	2>> 2	3>> 3	4>> 4	5>> 5	6>> 6	7>> 7	loop's per sec = 3225544
0>> 0	1>> 1	2>> 2	3>> 3	4>> 4	5>> 5	6>> 6	7>> 7	loop's per sec = 3225544

SAME IDE 1.8.12 with release TD 1.52 :: USING serialEvent():
Sketch uses 38848 bytes (0%) of program storage space. Maximum is 8126464 bytes.
Global variables use 49844 bytes (9%) of dynamic memory, leaving 474444 bytes for local variables. Maximum is 524288 bytes.
Code:
T:\tCode\Serial\SerialEventsTest\SerialEventsTest.ino May 22 2020 17:19:02

Teensy 4.1 :: TD 1.52 RELEASE 

TD 1.52 setup Test:

loop's per sec = 222073
0>> 0	1>> 1	2>> 2	3>> 3	4>> 4	5>> 5	6>> 6	7>> 7	loop's per sec = 1657236
0>> 0	1>> 1	2>> 2	3>> 3	4>> 4	5>> 5	6>> 6	7>> 7	loop's per sec = 1657235
0>> 0	1>> 1	2>> 2	3>> 3	4>> 4	5>> 5	6>> 6	7>> 7	loop's per sec = 1657234
0>> 0	1>> 1	2>> 2	3>> 3	4>> 4	5>> 5	6>> 6	7>> 7	loop's per sec = 1657235

SAME IDE 1.8.12 with release TD 1.52 :: NOT using serialEvent():
Code:
loop's per sec = 10525092
loop's per sec = 10525120
0] 0	0] 0	0] 0	
1] 1	1] 1	1] 1	
2] 2	2] 2	2] 2	
3] 3	3] 3	3] 3	
4] 4	4] 4	4] 4	
5] 5	5] 5	5] 5	
6] 6	6] 6	6] 6	
7] 7	7] 7	7] 7


Code:
const char szTeensy[] = "Teensy 4.1 :: CORES event PR";
#if 0
void yield() {}
const char szTest[] = "TD 1.52 :: PRIVATE  yield() :: setup Test:";
#else
const char szTest[] = "TD 1.52 setup Test:";
#endif
HardwareSerial *pSer[8] = { &Serial1, &Serial2, &Serial3, &Serial4, &Serial5, &Serial6, &Serial7, &Serial8 };
#define DO_SE1  1
#if 1 // MOVE this to determine which are declared 
#define DO_SE2  1
#define DO_SE3  1
#define DO_SE4  1
#define DO_SE5  1
#define DO_SE6  1
#define DO_SE7  1
#define DO_SE8  1
#endif

elapsedMillis serWait;
elapsedMillis serWaitPrt;
elapsedMillis loopTime;
uint32_t loopCnt = 0;
void loop() {
  if ( loopTime >= 1000 ) {
    loopTime -= 1000;
    Serial.printf("loop's per sec = %lu\n", loopCnt);
    loopCnt = 0;
  }
  loopCnt++;
  if ( serWait >= 1000 ) {
    serWait = 0;
    for ( int ii = 0; ii < 8; ii++ ) {
      pSer[ii]->print( ii );
    }
    if ( serWaitPrt > 2500 ) {
      serWaitPrt = 0;
      for ( int ii = 0; ii < 8; ii++ ) {
        if ( pSer[ii]->available() ) {
          while ( pSer[ii]->available() ) {
            Serial.printf( "%u] %c\t", ii, pSer[ii]->read() );
          }
          Serial.print( "\n");
        }
      }
    }
  }
}

void setup() {
  // put your setup code here, to run once:
  for ( int ii = 0; ii < 8; ii++ ) {
    pSer[ii]->begin( 2000000 );
  }
  while (!Serial) ; // wait
  Serial.println("\n" __FILE__ " " __DATE__ " " __TIME__);
  Serial.println(szTeensy);
  Serial.println(szTest);
  loopCnt = 0;
}


void readMe( int ii ) {
  if ( pSer[ii]->available() ) {
    while ( pSer[ii]->available() ) {
      Serial.printf( "%u>> %c\t", ii, pSer[ii]->read() );
    }
    //Serial.print( "\n");
  }
}

#if DO_SE1
void serialEvent1() {
  readMe( 0 );
}
#endif
#if DO_SE2
void serialEvent2() {
  readMe( 1 );
}
#endif
#if DO_SE3
void serialEvent3() {
  readMe( 2 );
}
#endif
#if DO_SE4
void serialEvent4() {
  readMe( 3 );
}
#endif
#if DO_SE5
void serialEvent5() {
  readMe( 4 );
}
#endif
#if DO_SE6
void serialEvent6() {
  readMe( 5 );
}
#endif
#if DO_SE7
void serialEvent7() {
  readMe( 6 );
}
#endif
#if DO_SE8
void serialEvent8() {
  readMe( 7 );
}
#endif
 
Works as well on T_3.6!
Moved 5 jumpers to the T_3.6 topside pins with minor edits for array declare and #define of number of ports to loop.

Results below when using Events - look to be worse.

PR code last pulled works at BETTER loop speed when no Event()'s declared.
Code:
Teensy :: CORES event PR

TD 1.52 setup Test:

loop's per sec = 1083273
loop's per sec = [B]2043918[/B]
loop's per sec = 2043943
0] 0	0] 0	0] 0	
1] 1	1] 1	1] 1	
2] 2	2] 2	2] 2	
3] 3	3] 3	3] 3	
4] 4	4] 4	4] 4

STOCK TD 1.52 slower - no events::
Code:
Teensy :: TD 1.52

TD 1.52 setup Test:

loop's per sec = 301454
loop's per sec = 716458
loop's per sec = [B]716465[/B]
0] 0	0] 0	0] 0	
1] 1	1] 1	1] 1	
2] 2	2] 2	2] 2	
3] 3	3] 3	3] 3	
4] 4	4] 4	4] 4	
loop's per sec = [B]868641[/B]

Though with 5 serialEvents() TD 1.52 Release::
Code:
loop's per sec = 390058
0>> 0	1>> 1	2>> 2	3>> 3	4>> 4	loop's per sec = 868665
0>> 0	1>> 1	2>> 2	3>> 3	4>> 4	loop's per sec = [B]868709[/B]
0] 0	
1] 1	
2] 2	
3] 3	
4] 4	
loop's per sec = 868709

And with the edited PR code there is a speed loss:
Code:
Teensy :: CORES event PR

TD 1.52 setup Test:

loop's per sec = 295075
0>> 0	1>> 1	2>> 2	3>> 3	4>> 4	loop's per sec = 615973
0>> 0	1>> 1	2>> 2	3>> 3	4>> 4	loop's per sec = [B]615992[/B]
0] 0	
1] 1	
2] 2	
3] 3	
4] 4	
loop's per sec = 615922

PR Code A bit better when only Serial1 and 2 Event() code in use:
Code:
Teensy :: CORES event PR

TD 1.52 setup Test:

loop's per sec = 328881
0>> 0	1>> 1	loop's per sec = 944055
0>> 0	1>> 1	loop's per sec = [B]944138[/B]
0] 0	
1] 1	
2] 2	2] 2	2] 2	
3] 3	3] 3	3] 3	
4] 4	4] 4	4] 4

And the TD 1.52 gets marginally worse than when using all five::
Code:
Teensy :: TD 1.52

TD 1.52 setup Test:

loop's per sec = 397889
0>> 0	1>> 1	loop's per sec = [B]768488[/B]
0>> 0	1>> 1	loop's per sec = 768498
0] 0	
1] 1	
2] 2	2] 2	2] 2	
3] 3	3] 3	3] 3	
4] 4	4] 4	4] 4	
loop's per sec = [B]868670[/B]

and PRIVATE local void yield() still much better:
Code:
Teensy :: CORES event PR

TD 1.52 :: PRIVATE  yield() :: setup Test:

loop's per sec = 1105308
loop's per sec = [B]2901101[/B]
loop's per sec = 2901124
0] 0	0] 0	0] 0	
1] 1	1] 1	1] 1	
2] 2	2] 2	2] 2	
3] 3	3] 3	3] 3	
4] 4	4] 4	4] 4

Trivial code changes to handle just 5 ports:
Code:
const char szTeensy[] = "Teensy :: CORES event PR";
//const char szTeensy[] = "Teensy :: TD 1.52";
#if 0
void yield() {}
const char szTest[] = "TD 1.52 :: PRIVATE  yield() :: setup Test:";
#else
const char szTest[] = "TD 1.52 setup Test:";
#endif
//HardwareSerial *pSer[8] = { &Serial1, &Serial2, &Serial3, &Serial4, &Serial5, &Serial6, &Serial7, &Serial8 };
//HardwareSerial *pSer[] = { &Serial1, &Serial2, &Serial3, &Serial4, &Serial5, &Serial6, &Serial7, &Serial8 };

// T_3.6 // 
HardwareSerial *pSer[] = { &Serial1, &Serial2, &Serial3, &Serial4, &Serial5 };

#define NUM_SER_LOOP 5
#define DO_SE2  1
#define DO_SE1  1
#define DO_SE3  1
#define DO_SE4  1
#define DO_SE5  1
#if 0 // MOVE this to determine which are declared 
#define DO_SE6  1
#define DO_SE7  1
#define DO_SE8  1
#endif

elapsedMillis serWait;
elapsedMillis serWaitPrt;
elapsedMillis loopTime;
uint32_t loopCnt = 0;
void loop() {
  if ( loopTime >= 1000 ) {
    loopTime -= 1000;
    Serial.printf("loop's per sec = %lu\n", loopCnt);
    loopCnt = 0;
  }
  loopCnt++;
  if ( serWait >= 1000 ) {
    serWait = 0;
    for ( int ii = 0; ii < NUM_SER_LOOP; ii++ ) {
      pSer[ii]->print( ii );
    }
    if ( serWaitPrt > 2500 ) {
      serWaitPrt = 0;
      for ( int ii = 0; ii < NUM_SER_LOOP; ii++ ) {
        if ( pSer[ii]->available() ) {
          while ( pSer[ii]->available() ) {
            Serial.printf( "%u] %c\t", ii, pSer[ii]->read() );
          }
          Serial.print( "\n");
        }
      }
    }
  }
}

void setup() {
  // put your setup code here, to run once:
  for ( int ii = 0; ii < NUM_SER_LOOP; ii++ ) {
    pSer[ii]->begin( 2000000 );
  }
  while (!Serial) ; // wait
  Serial.println("\n" __FILE__ " " __DATE__ " " __TIME__);
  Serial.println(szTeensy);
  Serial.println(szTest);
  loopCnt = 0;
}


void readMe( int ii ) {
  if ( pSer[ii]->available() ) {
    while ( pSer[ii]->available() ) {
      Serial.printf( "%u>> %c\t", ii, pSer[ii]->read() );
    }
    //Serial.print( "\n");
  }
}

#if DO_SE1
void serialEvent1() {
  readMe( 0 );
}
#endif
#if DO_SE2
void serialEvent2() {
  readMe( 1 );
}
#endif
#if DO_SE3
void serialEvent3() {
  readMe( 2 );
}
#endif
#if DO_SE4
void serialEvent4() {
  readMe( 3 );
}
#endif
#if DO_SE5
void serialEvent5() {
  readMe( 4 );
}
#endif
#if DO_SE6
void serialEvent6() {
  readMe( 5 );
}
#endif
#if DO_SE7
void serialEvent7() {
  readMe( 6 );
}
#endif
#if DO_SE8
void serialEvent8() {
  readMe( 7 );
}
#endif
 
Last edited:
Thanks @defragster - I can imagine when using serialEventX on T3.x, could be slightly slower, which I could reduce/eliminate, with trade offs
That is could have the yield code go back to directly calling code like:

Code:
if (Serial1.available()) serialEvent1();
if (Serial2.available()) serialEvent2();
...
Trade off is, this will as before pull in the code/data for all serial objects.

There may be a few shortcuts we can do. Again not sure how far to take this. Like could maybe know if any Serial object has any data in it and only continue then or ...
 
Would it be too much to have each SerialX Rx code set a flag?

And a way to put the same in the weak code - obviously always false.

When real code gets loaded on use - on Rx event the flag goes true. Safe to test without bringing in 'real code and buffer alloc'. Then when true - it says SerialX object has data.

May seem like it could lead to extra code on every Rx byte - but even 200K dataSerX=true {at 2M baud } - should be less overhead than 50K * {1-5} NON_STOP calls to SerialX.available(), when that calculates a number that doesn't matter with :
Code:
int HardwareSerial::available(void)
{
	uint32_t head, tail;

	head = rx_buffer_head_;
	tail = rx_buffer_tail_;
	if (head >= tail) return head - tail;
	return rx_buffer_total_size_ + head - tail;
}

A quick glance shows that would be in :
Code:
void HardwareSerial::IRQHandler() 
// ...
			} while (--avail > 0) ;
			rx_buffer_head_ = head;
			[B]dataSerX=true;[/B]

Then of course when buffer is read to head==tail empty : dataSerX=false;
Code:
int HardwareSerial::read(void)
{
	uint32_t head, tail;
	int c;

	head = rx_buffer_head_;
	tail = rx_buffer_tail_;
	if (head == tail) return -1;
	if (++tail >= rx_buffer_total_size_) tail = 0;
	if (tail < rx_buffer_size_) {
		c = rx_buffer_[tail];
	} else {
		c = rx_buffer_storage_[tail-rx_buffer_size_];
	}
	rx_buffer_tail_ = tail;
	if (rts_pin_baseReg_) {
		uint32_t avail;
		if (head >= tail) avail = head - tail;
		else avail = rx_buffer_total_size_ + head - tail;

		if (avail <= rts_low_watermark_) rts_assert();
	}
	[B]if (head == tail) dataSerX=false;[/B];
	return c;
}

Not sure that is everything/everywhere - but hopefully explained ... except a good variable name and where to add it.
 
Would it be too much to have each SerialX Rx code set a flag?

...

Tried the post #20 idea - somehow it runs at the same speed? Skipping if ( available() ) for with setting of rx_some_ it works ... but same reduced loop()/sec :
Code:
	inline void doYieldCode()  {
		if (rx_some_) (*hardware->_serialEvent)();
	}
 
Thanks @defragster

Earlier I tried a few different approaches like that and many of them did not help much.

Other approaches I have thought about, but have not tried include:

Have each Serial object, when it receives a character and puts them into Software queue, maybe remember something like the 32 bit microseconds... We keep two 32 bit values. The one mentioned, and one of what that first one was at start of the last time we found we had gone through the list and we nothing was available...
Something like:

Code:
uint32_t millis_last_empty = 0;
volatile uint32_t millis_last_data = 0;

... 

void check_hardware_serials() {
   if (millis_last_empty == millis_last_data ) return; // Nothing last time, nothing new
   bool event_called = false;
   uint32_t millis_last_data_start = millis_last_data;
   if (Serial1.available) { event_called = true; serialEvent1(); }
   if (Serial2.available) { event_called = true; serialEvent2(); }
...
   if (!event_called) millis_last_empty = millis_last_data_start;  // use the start time not current as something may have come in since
}

Another approach I was thinking, was I currently have that call as part of the HardwareSerial that calls available() on the object to see if it has data and then calls the saved event...
Right now in T3.x all methods go through virtual functions, which then call of through an individual method for each serial class...
Thought about getting the T3.x HardwareSerial code somewhat closer to the T4 where the is the root class... I had a full implementation earlier which is where the T4.x code started from.
But could go half way there... That is maybe create a structure which has things like head and tail pointer... And maybe then have the HardwareSerial object have pointer to that data... But difficulty is that each one may be of different sizes, as we have code in each of these source files that the rx_buffer_head and likewise tail, might be 1 byte or 2 bytes or 4 bytes in length, depending on the value of
SERIAL1_RX_BUFFER_SIZE ... So I punted on this approach.
 
Seemed Odd - that "if (rx_some_) (*hardware->_serialEvent)();" had the same overhead as "if ( available() )".

When all SerEvents are gone it was showing 'approx' just over 14M loops/sec instead of just under 15M - but enabling even one cuts it in half - enabling them all cuts that in half again.
 
I have been using the serialEvent() with Teensy 4.x and works great. Now as needed 5V IO using Teensy 3.2. With it the serialEvent() works, but not serialEvent1() ?

This works

Code:
char cmd;
char response;

void setup() {
  Serial.begin(115200);
  Serial1.begin(9600, SERIAL_8N2);

}

void loop() {
  
  if (Serial1.available() > 0) { serialEvent1(); }

}

void serialEvent1() {
if ( Serial1.available () > 0) {
    
  response = Serial1.read(); 
  Serial.print(response);
  if(response < 0x20) Serial.println();
}
}

void serialEvent() {
if ( Serial.available () > 0) {
  
  cmd = Serial.read();
  Serial1.write(cmd);  
}
}

But not without the if (Serial1.available() > 0) { serialEvent1(); } in the loop?
 
Back
Top