Teensy 4.0 First Beta Test

Status
Not open for further replies.
@mjs513 - link fail .. … Re: Benchmark STM32 vs ATMega328 (na...6 (teensy 3.2) - URL not there.

Thanks Tim - not sure how I did that = copy paste error looks like. Any link has been updated and here it is again: https://forum.arduino.cc/index.php?topic=431169.0

I'd be leery of the numbers at this point - data is about 3 years old and a lot of changes in the compiler has occurred and a lot of data elements to compare.
 
FWIW running the current code on the same Teensy's would give a feel for adjustments needed? And might be a cool indicator of the effects of the changes.

I noticed in the case of Telemetry - it boiled the Serial.print down to a single string print using snprint() - which I saw make a huge diff on the T4

Didn't find any old TelemView config files and they have a 'test' which would be perfect base - and give Arduino code for that - but I didn't figure out the graph drawing and get a saved config yet. Opened the UtUbe video to watch the UPDATED TelemetryViewer_v0.5.jar version - but watched some WWII movies instead :)
 
@PaulStoffregen and others

While poking STM32duino looking for stuff on CANFD I came across a couple of benchmarking sketches: Dhrystone, Whetstone-double precision and Whetstone-single precision. I copied them over and ran them on the T4B2 they did compile and run no issue. Just not sure what they all may mean - never saw them before except for PCs. I am attaching for reference but I would like to get a reading (Paul) on the validity for using them with the Teensies.

View attachment 16552

I also found this post on the Arduino Forum Re: Benchmark STM32 vs ATMega328 (nano) vs SAM3X8E (due) vs MK20DX256 (teensy 3.2) It also has data for T3.5 as well.

dhrystone has been discussed several times on teensy forum, but coremark is probably a better replacement. @MichaelMeissner has chatted about whetstone
https://forum.pjrc.com/threads/34808-K66-Beta-Test?p=119119&viewfull=1#post119119
https://forum.pjrc.com/threads/55481-Generated-Code-of-teensy3-6?p=201032&viewfull=1#post201032
thanks for porting whetstone to a teensy sketch. somewhere I have the old FORTRAN code ...

I provided the Teensy 3* benchmark data to the arduino forum you mention. I have the T4 data, but due to non-disclosure did not provide that info to that arduino thread. :D

I have some T4 floating point performance data at https://forum.pjrc.com/threads/54711-Teensy-4-0-First-Beta-Test?p=194187&viewfull=1#post194187
I added whetstone numbers.
 
FWIW running the current code on the same Teensy's would give a feel for adjustments needed? And might be a cool indicator of the effects of the changes.

I noticed in the case of Telemetry - it boiled the Serial.print down to a single string print using snprint() - which I saw make a huge diff on the T4

Didn't find any old TelemView config files and they have a 'test' which would be perfect base - and give Arduino code for that - but I didn't figure out the graph drawing and get a saved config yet. Opened the UtUbe video to watch the UPDATED TelemetryViewer_v0.5.jar version - but watched some WWII movies instead :)

@defragster - here's something to get you started - one of you old LPS sketches modified for use with telemetry viewer and a config file to plot the data in telemetry viewer. Enjoy
 

Attachments

  • lps_test_telem.zip
    1.3 KB · Views: 89
dhrystone has been discussed several times on teensy forum, but coremark is probably a better replacement. @MichaelMeissner has chatted about whetstone
https://forum.pjrc.com/threads/34808-K66-Beta-Test?p=119119&viewfull=1#post119119
https://forum.pjrc.com/threads/55481-Generated-Code-of-teensy3-6?p=201032&viewfull=1#post201032
thanks for porting whetstone to a teensy sketch. somewhere I have the old FORTRAN code ...

I provided the Teensy 3* benchmark data to the arduino forum you mention. I have the T4 data, but due to non-disclosure did not provide that info to that arduino thread. :D

I have some T4 floating point performance data at https://forum.pjrc.com/threads/54711-Teensy-4-0-First-Beta-Test?p=194187&viewfull=1#post194187
I added whetstone numbers.

Thanks for additional links. To be honest never looked too close at the who posted was late and this was pretty much the last thing I did before I got into the sack - was going to look some more. I was starting to read about coremark vs drystone this morning - the Wikipedia has a nice list of issues with dhrystone. Also like you said seems Coremark has taken over as the benchmark instead of dhrystone :)

I saw your posted the numbers :)

PS>> can't take credit for porting the benchmarks - as I said was looking for something else and found them already ported.
 
@manitou
Just for fun I ran the Whetstone for the Due and the STM32H743:
Code:
Due (?)
C Converted Single Precision Whetstones: 7.80 MIPS
C Converted Double Precision Whetstones: 5.40 MIPS

STM32H743 (480Mhz?)
C Converted Single Precision Whetstones: 471.70 MIPS
C Converted Double Precision Whetstones: 67.16 MIPS
 
TelemetryViewer: Telemetry Viewer v0.5 Changelog (2018-08-20)

Made a sketch that works - need to integrate the lines/sec count out a Serial# port - claims to be running 100,000/sec updates with test.txt below and then update the Sample Rate.

Here is the TelemetryViewer file in use:: View attachment RealTest.txt

A bit fast to read with only a 10us delay on writes. Bumped to program to expect '500000' Sample Rate (Hz) and put this as last code line in sketch :: delayMicroseconds(2); instead of (10)
TyComm and SerMon freeze after maybe 190 updates based on 'Wave c' value:
0,11,189,0.1564344615
0,11,190,0.17364

With 2us delay - Java JAR TelemetryViewer has run WAY more than a few cycles and is looking good.
opps - lost working short code that looked like this - it went bad when I changed to the following - not this is a mess too? I may need to reboot?
Code:
void setup() {
  Serial.begin(9600);
}

uint32_t count=0;
void loop() {
  count++;
  uint32_t waveform_a = ((count&0x3ff00) >>8);
  uint32_t waveform_b = ((count&0x3ffF0) >>4);
  uint32_t waveform_c = ((count&0x3ff));
  float sine_wave_1khz = sin(radians( count%90 ));

  char sine_wave_1khz_text[30];
  dtostrf(sine_wave_1khz, 10, 10, sine_wave_1khz_text);

  char text[124];
  snprintf(text, 124, "%d,%d,%d,%s", (int)waveform_a, (int)waveform_b, (int)waveform_c, sine_wave_1khz_text);
  Serial.println(text);
  delayMicroseconds(10);
}

First pass counting lps:
Code:
uint32_t prior_count;
uint32_t prior_msec;
uint32_t count_per_second;
uint32_t count = 0;
uint32_t msec;
void setup() {
  prior_count = count;
  count_per_second = 0;
  prior_msec = millis();
  msec = micros()/1000;
  Serial.begin(9600);
  Serial2.begin(2000000);
  Serial2.println(count_per_second);
  delay(1000);
}

void loop() {
  count++;
  uint32_t waveform_a = ((count & 0x3ff00) >> 8);
  uint32_t waveform_b = ((count & 0x3ffF0) >> 4);
  uint32_t waveform_c = ((count & 0x3ff));
  float sine_wave_1khz = sin(radians( count % 90 ));

  char sine_wave_1khz_text[30];

  dtostrf(sine_wave_1khz, 10, 10, sine_wave_1khz_text);

  char text[124];
  snprintf(text, 124, "%u,%u,%u,%s", (int)waveform_a, (int)waveform_b, (int)waveform_c, sine_wave_1khz_text);
  Serial.println(text);
  delayMicroseconds(10);

  msec = millis();
  if (msec - prior_msec > 1000) {
    prior_msec = prior_msec + 1000;
    count_per_second = count - prior_count;
    prior_count = count;
    Serial2.print(count_per_second);
    Serial2.println('.');
  }
}
 
@mjs513 - missed post #2730 above - sun was up so was past time to get to sleep.

Something very fragile in the balance … I can run current lps_test.exe against the TelemetryViewer output - it just reads blindly to a buffer - then syncs on '\n' to print a line segment per 'buffer'.

EXE runs some - then it gets _brk_ - TelView and Sermon's just stop too … Adjusting the delayus() uphelsp some - but I'm sure it was running at lower values - now best code is dying at 33us with only 27K lps.

Have lps_test.exe die/exit after 4 _brk_ to free the Teensy. And have sketch pause in setup with a blink while(!Serial) to show it waiting, and LED blinks during send every 0x800 counts.
Putting out a toggle pin to debug T_3.5 #13 - to get a FreqCount rather than sending text - integrated into EchoBoth.

I think I have a usable Demo that should run - and does if there is a long enough delayMicroseconds(); in loop(). But that limits the update freq …

After lunch I'll put it in a Windows_CarePack.zip of some sort. Very easy repro in a matter of seconds and easy to reset … TeensyButton reload TelemViewFast.ino, Cmdline re-run lps_test.exe with T4 pin 14 to T_3.5 pin 13 running echoBoth (Serial# not used) and Sermon on T_3.5 showing FreqCount.
 
Something very fragile in the balance

If you're running the usb_serial.c code from msg #2702, please remember it is known to have issues. That's why it isn't even on github yet.

If you edit startup.c, you can greatly lessen those issues by overclocking. I found 720 MHz makes it far more stable (but still does eventually lock up). Lowering to 450 or 396 MHz makes it far worse.

I'm continuing to look into this. Hope to have a real solution soon.
 
Understood - I didn't try with faster speed yet - to see and test it working I put this together as TelemetryViewer is a JAR that does pretty well with fast input. Though it fails so it gives better feedback to use the lps_test.exe I've been using.

https://github.com/Defragster/T4_demo/tree/master/TView

Once setup is easy to run and restart. Will be fun to see when it is all cleaned up. Until then it is eay to get the drop to happen.

There is a readme.txt on github for setup if interested.

Setting F_CPU_Actual to 720MHz ends about the same - at 800 MHz - it hangs on startup … prior testing seemed to work there

Test Package:
lps_test.exe : for Windows - runs to receive 100 blocks of 32KB
>> Param1 :: serial port like 'Com25'
>> Param2 :: 0 for 100000000 block loops
>> Param2 :: 1 for 1000000 block loops
>> Param2 :: 2-9 for # of 100 block loops
OUTPUT :: reads COM# block of data and shows one string per block

lps_test.c : source for precompiled EXE - could take mods for non-Windows

bld2.bat : cmdline compile for gcc

EchoBoth : Sketch for second Teensy.
>> Echos Serial1 input at 115200 baud
>> Read FreqCount from pin13

TelemViewFast : sketch for T4 to demo Lines Per Second output
>> Serial2 connects 115200
>> pinMode(14, OUTPUT); // Toggle w/each output string for FreqCount
>> uint32_t usDelay = 10; // start us delay between transmit
>> //#define DELAY_DROP 1 // comment to keep a static delay


Usage: Wire Serial as noted above, and FreqCount pin
Put Echoboth on a Teensy
Put TelemViewFast on Target Teensy
From Windows cmdline run for device port:: lps_test COM25

OPTIONAL: when working the output is formatted for TelemetryViewer graphing. Download the JAR from 'http://www.farrellf.com/TelemetryViewer/' and get it running with proper JAVA runtime. 'Layout' is in file : "RealTest20.txt"
 
I'm committed a (hopefully) working usb_serial.c and fixes to the transfer scheduling, and a first crude attempt at better transmit buffering.

https://github.com/PaulStoffregen/cores/commit/533a0c2c0026c7c328fb19b5b94b1a1926731cd3

Highly recommend saving any work you may have open in Arduino. This version runs faster. The lines/sec test might overwhelm the serial monitor on some computers.

Edit: this code also has a known problem where buffered data isn't automatically transmitted if you don't keep writing. I'll work on that tomorrow...
 
Awesome magic Paul!

Took the delay out of the sketch - getting lps freq of 87.6 to 89 K Hz with FreqCount() on T-3.5 - so 178,000 lines per second using lps_text.exe:
Code:
fCnt=89799
fCnt=88009
fCnt=87766
fCnt=33997
fCnt=0
fCnt=78273
fCnt=89179

Lines are short:
Code:
#13944[32K] : __>> [COLOR="#FF0000"]1,16360,649,81[/COLOR] <<__
#13945[32K] : __>> [COLOR="#FF0000"]1,129,16,89[/COLOR] <<__

That '0' is when I stopped the Windows CMD line - DOS box - scroll to grab the text above.

As of now the current run has completed 33,000+ reads of 32KB buffers from the T4. I've stopped the EXE 3 times by selecting text in the DOS box
 
Added info to post #2738 with code from #2736:

Testing Sermon and TyComm alternating the T4 keeps sending messages - starting and stopping one for the other - or the GUI button in TyComm - the Sketch keeps working - unlike before this new release.

> Teensy SerMon can maintain same speed as the EXE - until is gets messed up. It has devolved to this:
Code:
fCnt=87889
fCnt=27343
fCnt=0
fCnt=87459
fCnt=87814
fCnt=68860
fCnt=0

> TyComm quickly gets overwhelmed - and the LED flash goes from dim processing - to pulsing as the system copes with the speed fluctuating like this - where the '0' is off and on:
Code:
fCnt=17163
fCnt=24399
fCnt=26972
fCnt=8832
fCnt=0
fCnt=19187
fCnt=17665
fCnt=25228

So that looks very good and stable - AND FAST. For TyComm koromix said the display window is a text edit box - which is why it is so full featured to resize text [ctrl+mouse wheel] and cut to clipboard.

Going back to the exe that prints one line from each 32KB buffer it is back to 87-89K newlines per second based on the FreqCount Teensy bit toggle.
 
Not seeing why but my latency_test is getting 'x' into the Teensy - but nothing is coming out? Using the latest cores used in prior posts.

I am getting then sending that 'x' out Serial2 and if ('x') flashing the LED …

Paul: Then going back to this PJRC source sketch then using SerMon to send 'x' I continue to get nothing back?
Code:
void setup() {
  Serial.begin(115200);
  Serial.flush();
}

void loop() {
  if (Serial.available()) {
    byte c = Serial.read();
    if (c == 'x') {      // 'x' is end of input message
      Serial.write('0');
      Serial.write('1');
      Serial.write('2');
      Serial.write('x');
      Serial.send_now(); // comment out for non-Teensy boards
    }
  }
}

BTW: Indeed the IDE was really hosed … I had to kill it and even the " File \ Recent " memory is empty . . . it booted up on an Arduino board not remembering the last open board. It is 1.47b1 and has all the boards there - it just got Bashed.
 
Yep, the missing transmit flush will cause the latency test to fail. Will fix that soon.

On Linux, I'm seeing 375K lines/sec in the serial monitor. Well, at least for several seconds, until Java seems to get overwhelmed and freezes up (maybe garbage collection?) for a second or so, and then the process repeats over and over. It's been running that way for a couple hours. Line count is now up to 3396121472.

Running "top" to monitor processes shows Java consuming 40% to 50% CPU (of a single core) most of the time, then it jumps to 400% to 700% during the freezes.
 
Last edited:
Thanks for context - good it is known - missed the #2737 'edit'

I put up github.com/... /linux .c and an exe built under wsl. I attempted the same edits to the Linux branch I made to WINDOWS - interesting to know it if runs as built against lps_test sketch like on windows.

It asks the system to fill a buffer - there as 64KB - then pulls out one transmission from each buffer to show on console to minimize output. I wonder how that does on Linux?

lps_test port 0 where opt 0 param says run 10M buffers full.

Also there is sketch github.com/.../lps_testr is the inverse transfer lps test. It counts \n and once per second shows Rx buffer and appends the count of the lines Rx's from lps_test port S out Serial2 with that "S" as param 2. It is printing better now - but only 304 lps. Just added pin 14 toggle on count to feed freqCount on pin 13 in EchoBoth debug monitor in that TView folder
 
That looks good Paul - no errors - and much better than it was - faster [ was 0.3 secs higher before IIRC ] and ~2% of transfers are at MAX value - before it was ~5% showing at MAX value

<edit>: Note instead of pushing '0'd buffer code below is sending repeating 'A-Z' as needed to fill the buffer, and any mismatch on sketch receive would be reported/shown as ERROR in reply string. Code Posted

This is using the 'latency_test.c:164:3: warning: 'gettimeofday' is deprecated':
Code:
T:\T_Downloads\pjrc_latency_test>latency_test COM25
port COM25 opened
waiting for board to be ready:
.ok
latency @    1 bytes: 0.00 ms average,   0 max hits,    [B]100.00 2nd max[/B],         0.00 maximum
latency @    2 bytes: 0.00 ms average,   0 max hits,    100.00 2nd max,         0.00 maximum
latency @   12 bytes: 0.16 ms average,   1 max hits,    0.00 2nd max,   15.61 maximum
latency @   16 bytes: 0.00 ms average,   0 max hits,    100.00 2nd max,         0.00 maximum
latency @   30 bytes: 0.22 ms average,   1 max hits,    0.00 2nd max,   15.62 maximum
latency @   31 bytes: 0.00 ms average,   0 max hits,    100.00 2nd max,         0.00 maximum
latency @   63 bytes: 0.16 ms average,   1 max hits,    0.00 2nd max,   15.63 maximum
latency @   64 bytes: 0.16 ms average,   1 max hits,    0.00 2nd max,   15.62 maximum
latency @   65 bytes: 0.00 ms average,   0 max hits,    100.00 2nd max,         0.00 maximum
latency @   71 bytes: 0.00 ms average,   0 max hits,    100.00 2nd max,         0.00 maximum
latency @  126 bytes: 0.16 ms average,   1 max hits,    0.00 2nd max,   15.62 maximum
latency @  127 bytes: 0.16 ms average,   1 max hits,    0.00 2nd max,   15.62 maximum
latency @  128 bytes: 0.00 ms average,   0 max hits,    100.00 2nd max,         0.00 maximum
latency @  129 bytes: 0.22 ms average,   1 max hits,    0.00 2nd max,   15.62 maximum
latency @  500 bytes: 0.16 ms average,   1 max hits,    0.00 2nd max,   15.62 maximum
latency @  512 bytes: 0.31 ms average,   2 max hits,    15.62 2nd max,  15.62 maximum
latency @  640 bytes: 0.31 ms average,   2 max hits,    15.62 2nd max,  15.67 maximum
latency @ 1000 bytes: 0.38 ms average,   2 max hits,    15.59 2nd max,  15.63 maximum
latency @ 1278 bytes: 0.47 ms average,   2 max hits,    15.65 2nd max,  15.65 maximum
latency @ 1279 bytes: 0.53 ms average,   3 max hits,    15.62 2nd max,  15.67 maximum
latency @ 1280 bytes: 0.62 ms average,   3 max hits,    15.63 2nd max,  15.70 maximum
latency @ 1281 bytes: 0.38 ms average,   2 max hits,    15.65 2nd max,  15.65 maximum
latency @ 2000 bytes: 0.69 ms average,   2 max hits,    15.65 2nd max,  15.74 maximum
latency @ 2047 bytes: 0.78 ms average,   1 max hits,    0.00 2nd max,   15.81 maximum
latency @ 2048 bytes: 0.84 ms average,   1 max hits,    0.00 2nd max,   15.76 maximum
latency @ 2049 bytes: 0.69 ms average,   3 max hits,    15.72 2nd max,  15.77 maximum
latency @ 4000 bytes: 1.31 ms average,   2 max hits,    15.82 2nd max,  15.83 maximum
latency @ 4095 bytes: 1.22 ms average,   3 max hits,    15.43 2nd max,  15.80 maximum
latency @ 4096 bytes: 1.47 ms average,   2 max hits,    15.81 2nd max,  15.81 maximum
latency @ 4097 bytes: 1.31 ms average,   2 max hits,    15.71 2nd max,  15.79 maximum
latency @ 8000 bytes: 2.39 ms average,   2 max hits,    6.65 2nd max,   15.85 maximum
 UP ----- pass #1        elapsed time 1.588 secs for 4106700 bytes
latency @ 8000 bytes: 2.31 ms average,   3 max hits,    15.80 2nd max,  15.84 maximum
latency @ 4097 bytes: 1.22 ms average,   3 max hits,    15.63 2nd max,  15.77 maximum
latency @ 4096 bytes: 1.32 ms average,   4 max hits,    15.65 2nd max,  15.71 maximum
latency @ 4095 bytes: 1.32 ms average,   3 max hits,    15.70 2nd max,  15.72 maximum
latency @ 4000 bytes: 1.22 ms average,   4 max hits,    15.68 2nd max,  15.75 maximum
latency @ 2049 bytes: 0.78 ms average,   2 max hits,    15.63 2nd max,  15.78 maximum
latency @ 2048 bytes: 0.85 ms average,   3 max hits,    15.63 2nd max,  15.63 maximum
latency @ 2047 bytes: 0.69 ms average,   3 max hits,    15.63 2nd max,  15.78 maximum
latency @ 2000 bytes: 0.69 ms average,   2 max hits,    15.46 2nd max,  15.66 maximum
latency @ 1281 bytes: 0.63 ms average,   4 max hits,    15.68 2nd max,  15.68 maximum
latency @ 1280 bytes: 0.53 ms average,   2 max hits,    15.48 2nd max,  15.70 maximum
latency @ 1279 bytes: 0.47 ms average,   2 max hits,    15.58 2nd max,  15.70 maximum
latency @ 1278 bytes: 0.53 ms average,   2 max hits,    15.62 2nd max,  15.63 maximum
latency @ 1000 bytes: 0.47 ms average,   3 max hits,    15.62 2nd max,  15.64 maximum
latency @  640 bytes: 0.31 ms average,   2 max hits,    15.60 2nd max,  15.71 maximum
latency @  512 bytes: 0.22 ms average,   2 max hits,    6.50 2nd max,   15.62 maximum
latency @  500 bytes: 0.16 ms average,   1 max hits,    0.00 2nd max,   15.63 maximum
latency @  129 bytes: 0.16 ms average,   1 max hits,    0.00 2nd max,   15.62 maximum
latency @  128 bytes: 0.16 ms average,   1 max hits,    0.00 2nd max,   15.62 maximum
latency @  127 bytes: 0.16 ms average,   1 max hits,    0.00 2nd max,   15.62 maximum
latency @  126 bytes: 0.00 ms average,   0 max hits,    100.00 2nd max,         0.00 maximum
latency @   71 bytes: 0.07 ms average,   1 max hits,    0.00 2nd max,   6.50 maximum
latency @   65 bytes: 0.16 ms average,   1 max hits,    0.00 2nd max,   15.63 maximum
latency @   64 bytes: 0.00 ms average,   0 max hits,    100.00 2nd max,         0.00 maximum
latency @   63 bytes: 0.16 ms average,   1 max hits,    0.00 2nd max,   15.62 maximum
latency @   31 bytes: 0.16 ms average,   1 max hits,    0.00 2nd max,   15.62 maximum
latency @   30 bytes: 0.00 ms average,   0 max hits,    100.00 2nd max,         0.00 maximum
latency @   16 bytes: 0.16 ms average,   1 max hits,    0.00 2nd max,   15.62 maximum
latency @   12 bytes: 0.16 ms average,   1 max hits,    0.00 2nd max,   15.62 maximum
latency @    2 bytes: 0.00 ms average,   0 max hits,    100.00 2nd max,         0.00 maximum
latency @    1 bytes: 0.07 ms average,   1 max hits,    0.00 2nd max,   6.51 maximum
 DOWN --- pass #1        elapsed time 1.572 secs for 4106700 bytes
>> 100.00 2nd max : means it was not changed from the compile set value.
 
Last edited:
@Paul - Looks great, other than Java exception: Exception in thread "AWT-EventQueue-0" java.lang.OutOfMemoryError: Java heap space

Now all you need to do is to rewrite the Java code ;)
Code:
count=86447975, lines/sec=451763
count=86447976, lines/sec=451763
count=86447977, lines/sec=451763
count=86447978, lines/sec=451763
count=86447979, lines/sec=451763
count=86447980, lines/sec=451763
 
I comment out first line in printf.h to disable debug printf, and now T4B2 sketches won't upload unless I press the program button. If I uncomment the first line in printf.h, then uploads/run work again.

Committed a patch to hopefully will fix this bug.

https://github.com/PaulStoffregen/cores/commit/93a2e12268e2f9fa1f08e68e5063a85c01f4ba89

I still don't understand why interrupting on the final ack was notifying too soon. Perhaps there is more complexity inside the USB controller than I had imagined, like maybe a write-back buffer or cache of some sort, so the data receive transfer actually completes later than the final transfer to send ack for the entire control transfer?
 
Status
Not open for further replies.
Back
Top