defragster
Senior Member+
Really glad you did. It's now on my list to fix for 1.55-beta3.
I will send you another 4.0 lockable beta board, and a lockable 4.1 early next week.
Nice! Glad I ran headlong into that to get it taken care of.
Some on this ( from Beta 1 thread):
Cool that you look and ran it! ... Code4Code.ino
I can give it a look. As posted indeed it runs forever - doing the same thing - and currently it starts the _isr() at high rate slowing the loop() test tremendously.
Primary goal was {shotgun spec} to stuff a lot of code and reference data in FLASH and see that it:
A: Worked at all - some repeat task that worked out a known result of 60 Pi digits.
B: Could deal with odd ptr[] boundary reads, sizes, directions in the added _isr()
C: Dealt with Code Cache being a bit to small - for some 100 KB's
D: Dealt with data Cache being a bit small for 4,000 copies of const szPi[64] == 256KB
E: Then monitoring cycles used I found some measure of the run rate
F: Run Forever meant it could show changing Temp
G: Run the 4,000 cascading func####()'s - or call the terminal Func0() 4,000 times
I finished the spec'd task list and all seemed well and first indication was ~5% overhead ...
Given the caveats [A-G] - there are not many moving parts in that test code to measure, and they overlap.
The spewed data was only to prove the code functioned blowing out the 32KB cache - which meant easy to produce (repetitive) code that was FAT in the end, and computation heavy making Pi digits. Taking that out leaves some 'net' time in the cascading call chain where the 'code' {encrypted or not} needs to be loaded to run in cache.
That and some teeth/hair pulling suggests that extreme case without the _isr() active is 9.6% faster:
Code:
>> No _isr Casc Net Direct Net Mult Factor [B]Rel Diff[/B]
10379960 nor ns: @600 29019 1969 14.73793804 [B]1.096993871[/B]
10379970 ENC SM: @600 31866 1971 16.1674277 0.910836813
6052840 nor ns: @600 29010 1970 14.72588832 [B]1.097891505[/B]
Adding the BIG _isr overhead the difference largely goes away {so not cleaning the tab formatting} - at least with the treatment applied.
Which gives wonder how 9% faster and 6% faster is getting lost in the mix going toward zero and not at least adding together?
But the _isr() char[] compares really is a seperate test - see below:
Code:
>> With _isr Casc Net Direct Net Mult Factor Rel Diff
10379960 nor ns: @600 35276 3617 9.75283384 0.102534301
10379970 ENC SM: @600 38537 4044 9.529426311 0.104938111
6052840 nor ns: @600 34608 3529 9.80674412 0.101970643
But when looking at the time in the _isr alone it shows 5.8 to 6% improvement in that - running _isr() at 3 rates.
Did the math for two .hex versus the .eHex and they gave these numbers - so here is one set:
Code:
_isr Only CPU % @50 us @100 us @200 us
10379960 nor ns: @600 0.377891 0.192134 0.092131
10379970 ENC SM: @600 0.400139 0.203387 0.097733
[B]1.05887412 1.058568499 1.060804724[/B]
If I'm done tweaking I could CODE that Excel work with fixed 'ref numbers' and spit out the 'Expected %' as a benchmark?
Here is the RAW data for the above:
Code:
Not enabled _isr() char[] Testing.
Cascading 4000 calls took 5131274 us [3061353591 piCycles] : net 29019 us
Direct calls took 5111882 us [3065947966 piCycles] : net 1969 us
ENABLED _isr() char[] Test @50 us _isr Cycles 226734696 of 600000060 : CPU =0.377891
ENABLED _isr() char[] Test @100 us _isr Cycles 115280639 of 600000058 : CPU =0.192134
ENABLED _isr() char[] Test @200 us _isr Cycles 55278436 of 600000059 : CPU =0.092131
Cascading 4000 calls took 5692838 us [5657562 piCycles] : net {less Pi} 35276 us
_isr Cycles 325468326 of 3415703076 : CPU =0.095286
Cascading 4000 calls took 5692838 us : net {less isr} 3394537780 us
Cascading 4000 calls took 5692838 us : Cycles/call 1423
Direct calls took 5673276 us [5669659 piCycles] : net {less Pi} 3617 us
_isr Cycles 326463030 of 3403965837 : CPU =0.095907
Direct calls took 5673276 us [3401795860 piCycles] : net {less isr} 5129171 us
Direct calls took 5673276 us [3401795860 piCycles] : Cycles/call 1418
Code:
Not enabled _isr() char[] Testing.
Cascading 4000 calls took 5135154 us [3061973078 piCycles] : net 31866 us
Direct calls took 5111885 us [3065948739 piCycles] : net 1971 us
ENABLED _isr() char[] Test @50 us _isr Cycles 240083719 of 600000060 : CPU =0.400139
ENABLED _isr() char[] Test @100 us _isr Cycles 122032171 of 600000057 : CPU =0.203387
ENABLED _isr() char[] Test @200 us _isr Cycles 58639895 of 600000057 : CPU =0.097733
Cascading 4000 calls took 5733775 us [5695238 piCycles] : net {less Pi} 38537 us
_isr Cycles 346975911 of 3440265018 : CPU =0.100857
Cascading 4000 calls took 5733775 us : net {less isr} 3417142899 us
Cascading 4000 calls took 5733775 us : Cycles/call 1433
Direct calls took 5710594 us [5706550 piCycles] : net {less Pi} 4044 us
_isr Cycles 347544584 of 3426355847 : CPU =0.101433
Direct calls took 5710594 us [3423930197 piCycles] : net {less isr} 5131354 us
Direct calls took 5710594 us [3423930197 piCycles] : Cycles/call 1427
Code:
Not enabled _isr() char[] Testing.
Cascading 4000 calls took 5131267 us [3061344199 piCycles] : net 29027 us
Direct calls took 5111882 us [3065948170 piCycles] : net 1969 us
ENABLED _isr() char[] Test @50 us _isr Cycles 226738784 of 600000060 : CPU =0.377898
ENABLED _isr() char[] Test @100 us _isr Cycles 115292902 of 600000058 : CPU =0.192155
ENABLED _isr() char[] Test @200 us _isr Cycles 55287032 of 600000060 : CPU =0.092145
Cascading 4000 calls took 5692739 us [5658131 piCycles] : net {less Pi} 34608 us
_isr Cycles 325396831 of 3415643876 : CPU =0.095267
Cascading 4000 calls took 5692739 us : net {less isr} 3394878992 us
Cascading 4000 calls took 5692739 us : Cycles/call 1423
Direct calls took 5673254 us [5669725 piCycles] : net {less Pi} 3529 us
_isr Cycles 326391744 of 3403952525 : CPU =0.095886
Direct calls took 5673254 us [3401835257 piCycles] : net {less isr} 5129268 us
Direct calls took 5673254 us [3401835257 piCycles] : Cycles/call 1418
"Revised Spew" code updated at Defragster/T4LockBeta/tree/main/Code4Code
Added a 4 deep cascade as .txt and a similar copy of the 4000 deep cascade - I used the 4 deep version as "CodeMade.ino" to reduce the number of 3 minute builds as the output was tweaked.
GOOD NEWS: Lots of 'Verify' builds and TyComm SerMon GUI 'Bootloader' uploaded with Teensy Loader and No Issues - pushing to THREE T_4.0's - One Locked, one Beta unlockable, one production. Except for noted 12 sec versus 6 second programming time - the Unlocked/Not Secure Beta produced the same times as the Production T4.
One more Note: It takes just over 9ms to get 800 digits of Pi with the code I found. I settled on 60 digits that takes about 1ms. So as written it takes more time to run the code than to load it. The Pi code could be refined/replaced some and settling for 12 digits would cascade faster, now that I know it works - which was Job #1. That might give a clearer Perf picture with less finagling.
> Also I see I don't know how to printf a '%' :: printf("%%");
Last edited: