Teensy 4.0 First Beta Test

Status
Not open for further replies.
Figured it was time to put all this data into perspective

Haha - it's a bit premature to make pretty graphs!

The new code has absolutely no optimizations yet, and it still appears to have some serious bugs. I'm about to start looking into why the latency test is failing with 64 bytes.

Please, would you do me a small favor and keep the benchmark stuff text only for now? These sorts of graphs do make a strong impression on people who don't read the finer details. I definitely would like to see this sort of graph comparison in a few weeks. But please, let's only do this when comparing well tested and optimized code against the very mature and highly optimized Teensy 3.x USB code, and maybe other Arduino compatible boards with stable released code.

On benchmarking, here's another really simple speed test I wrote yesterday. Let it run for about 5 seconds, or until the lines/sec number becomes fairly stable.

uint32_t count, prior_count;
uint32_t prior_msec;
uint32_t count_per_second;

// Uncomment this for boards where SerialUSB needed for native port
//#define Serial SerialUSB

void setup() {
Serial.begin(1000000);
while (!Serial) ;
count = 10000000;
prior_count = count;
count_per_second = 0;
prior_msec = millis();
}

void loop() {
Serial.print("count=");
Serial.print(count);
Serial.print(", lines/sec=");
Serial.println(count_per_second);
count = count + 1;
uint32_t msec = millis();
if (msec - prior_msec > 1000) {
prior_msec = prior_msec + 1000;
count_per_second = count - prior_count;
prior_count = count;
}
}

On this test too, Teensy 3.x with years of careful optimization handily beats the no-optimization code on Teensy 4.0. But that situation won't last long. I will do quite a lot of optimization on T4 soon. Its USB is 480 Mbit/sec speed and the USB controller works at the transfer level rather than the token level as we have on Teensy 3.x, and of course we have a much faster CPU, so the potential for amazing speed is there. But with USB, even incredibly fast hardware does poorly without careful optimization work in the software.

If you run this speed test on some of Arduino's boards, it's pretty amazing how much the speed varies. Some are amazingly slow. It's also kind of interesting to compare to Teensy 2.0, where the code has a lot of optimization but the hardware is only 8 bits at 16 MHz without DMA.
 
Haha - it's a bit premature to make pretty graphs!

The new code has absolutely no optimizations yet, and it still appears to have some serious bugs. I'm about to start looking into why the latency test is failing with 64 bytes.

Please, would you do me a small favor and keep the benchmark stuff text only for now? These sorts of graphs do make a strong impression on people who don't read the finer details. I definitely would like to see this sort of graph comparison in a few weeks. But please, let's only do this when comparing well tested and optimized code against the very mature and highly optimized Teensy 3.x USB code, and maybe other Arduino compatible boards with stable released code.

On benchmarking, here's another really simple speed test I wrote yesterday. Let it run for about 5 seconds, or until the lines/sec number becomes fairly stable.



On this test too, Teensy 3.x with years of careful optimization handily beats the no-optimization code on Teensy 4.0. But that situation won't last long. I will do quite a lot of optimization on T4 soon. Its USB is 480 Mbit/sec speed and the USB controller works at the transfer level rather than the token level as we have on Teensy 3.x, and of course we have a much faster CPU, so the potential for amazing speed is there. But with USB, even incredibly fast hardware does poorly without careful optimization work in the software.

If you run this speed test on some of Arduino's boards, it's pretty amazing how much the speed varies. Some are amazingly slow. It's also kind of interesting to compare to Teensy 2.0, where the code has a lot of optimization but the hardware is only 8 bits at 16 MHz without DMA.

Sorry Paul
I went ahead and deleted the graphs, with all the numbers being thrown out I was loosing perspective :) I understand there is still a lot to do with USB including optimizations but I am a kind of visual person so I like pretty pictures but fully understand your point. To be honest was a bit leery about running any of the tests on USB but wanted to get a baseline of where we are now compared to where it ends up when you get done with USB. Like you said it, probably a bit premature for pictures.

Going to go ahead and run the tests and maybe just put it in a spreadsheet for now - no graphs I promise.
 
@PaulStoffregen

Ran your speed test on a variety of boards besides the T4 for later comparison. Wow you are right abou the Arduino boards. Even my Nucleo-H7 board has some issues. Couldn't find my Arduino Due board - forgot where I put it.

Code:
	T4 (9f4d02c)	T3.6	T.3.5	2.0++	Uno	mega	Nucleo144-H7	101    DUE
Avg	    3999	26873.5	26877.5	8577	1268	1267.5	2818.5         475.5   376.5
min	    3998	26841	26837	8568	1267	1266	2818	        475     376
max	    4000	26906	26918	8586	1269	1269	2819	        476     377
 
Last edited:
You can also expect to see different speed with versus without a USB hub. You might think a hub would slow things down, but in many cases you'll see better performance for 12 Mbit/sec devices due to how the hub's transaction translator works. The TT effectively creates another fully dedicated 12 Mbit channel to the downstream (from the hub) device. Many hubs have TTs that are surprisingly good at estimating what packets they can schedule before the SOF, which also translates into a little more speed in some cases.

Don't be shy about putting T4's numbers on the forum. That's perfectly ok. But please keep comparisons text only. Maybe also mention which version/commit you tested since we're in rapidly changing beta...
 
@PaulStoffregen and @mjs513, I tried a few boards....

T4 appeared to be about 3990
OpenCM9.04 2670
OpenCR1 15515
T3.6 26300

So T3... goes pretty fast :D
 
@KurtE and @PaulStoffregen

Just added T4 with the latest commit (as of 3hours ago) to post #2628.... lots of stuff left on USB.. don't envy you Paul
 
WOW - the flashcode text has 5 Views! Paul hopefully you are one of them?

Yes, one of those was me. :)

I looked at this briefly yesterday, doing a few quick visual checks of the hexdump (msg #2582) against the original .hex file (msg #2305). While it looked ok, nothing like 256 byte chunks all 00 or FF, I quickly realized I didn't have a copy of the original in hexdump format, and neither an easy way to turn the hexdump back into intel hex .HEX format.

I didn't want to compare against a hexdump of another board, in case there's a bug in my test fixture code which wrote those images into the top sector of the flash. In that case, every board might have the same corrupted restore program.

Just now I wrote a little program which turns the original .HEX file into the same hexdump as that code on msg #2576 creates. Here's the hexdump of the original code.

Code:
0000: 46 43 46 42 00 00 01 56 00 00 00 00 01 03 03 00
0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0040: 00 00 00 00 01 04 02 00 00 00 00 00 00 00 00 00
0050: 00 00 20 00 00 00 00 00 00 00 00 00 00 00 00 00
0060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0080: EB 04 18 0A 06 32 04 26 00 00 00 00 00 00 00 00
0090: 05 04 04 24 00 00 00 00 00 00 00 00 00 00 00 00
00A0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00B0: 06 04 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00C0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00D0: 20 04 18 08 00 00 00 00 00 00 00 00 00 00 00 00
00E0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00F0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0100: D8 04 18 08 00 00 00 00 00 00 00 00 00 00 00 00
0110: 02 04 18 08 04 20 00 00 00 00 00 00 00 00 00 00
0120: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0130: 60 04 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0140: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0150: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0160: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0170: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0190: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
01A0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
01B0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
01C0: 00 01 00 00 00 10 00 00 01 00 00 00 00 00 00 00
01D0: 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00
01E0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
01F0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0200: D1 00 20 40 2C 10 00 60 00 00 00 00 00 00 00 00
0210: 20 10 00 60 00 10 00 60 00 00 00 00 00 00 00 00
0220: 00 00 00 60 10 16 00 00 00 00 00 00 00 60 00 20
0230: 91 10 00 60 F0 B5 15 4B 40 20 4F F4 80 31 4F F4
0240: 80 54 4F F4 00 55 4F F4 40 46 43 F2 42 07 1A 69
0250: 12 F0 02 0F 03 D0 9E 61 59 61 9F 61 F7 E7 12 F4
0260: 00 5F 01 D1 5D 61 F2 E7 12 F4 80 5F 01 D1 5C 61
0270: ED E7 00 2A EB DA 12 F4 80 3F 01 D0 99 61 E6 E7
0280: 52 06 01 D4 58 61 E2 E7 F0 BD 00 BF 00 80 0D 40
0290: 70 B5 72 B6 00 23 35 4A 35 4C 36 49 98 18 88 42
02A0: 03 D2 18 59 98 50 04 33 F8 E7 33 4A 00 20 33 49
02B0: 8A 42 4F F0 00 03 02 D2 10 60 04 32 F8 E7 30 49
02C0: 10 22 30 48 30 4C 03 60 0A 60 30 4A 14 60 30 4C
02D0: 0C 60 30 4C 14 60 30 4C 0C 60 30 4C 14 60 30 4C
02E0: 0C 60 30 49 11 60 01 21 01 60 BF F3 4F 8F BF F3
02F0: 6F 8F C2 F8 B0 31 BF F3 4F 8F BF F3 6F 8F 50 F8
0300: 80 2C 6F F0 7F 44 A0 F5 58 60 27 4E 42 F4 40 32
0310: C0 F8 00 2D 25 4A 13 60 04 60 25 4C 23 60 11 60
0320: FF F7 88 FF 00 F0 B2 F8 22 4A 93 6E 43 F0 40 43
0330: 93 66 05 22 20 4B C3 F8 48 21 1C 22 C3 F8 38 23
0340: A3 F5 70 33 08 22 5A 60 C3 F8 84 20 25 68 00 F0
0350: 1B F9 22 68 AB 1A 23 F0 7F 43 B3 42 F7 D9 08 21
0360: 16 4B 15 46 C3 F8 8C 10 F1 E7 00 BF 00 00 00 20
0370: 10 16 00 60 60 01 00 20 00 10 00 20 C0 32 00 20
0380: 9C ED 00 E0 94 ED 00 E0 25 00 08 03 A0 ED 00 E0
0390: 12 00 00 20 25 00 08 13 14 00 00 40 33 00 10 13
03A0: 15 00 00 60 2F 00 0B 07 A0 86 01 00 10 E0 00 E0
03B0: 18 E0 00 E0 00 C0 0F 40 00 80 1F 40 00 C0 1B 40
03C0: 30 B5 23 4C E9 B9 00 2A 22 4B 48 F2 80 02 4F F0
03D0: 01 01 08 BF 80 22 A3 64 19 60 18 BF 4F F4 80 31
03E0: 5A 60 4F F0 00 02 08 BF 00 21 9A 60 1A 4B E2 64
03F0: D3 F8 B0 21 42 F4 80 32 C3 F8 B0 21 17 4A 11 60
0400: 21 E0 09 04 16 4B 01 25 41 F0 80 01 98 60 A3 60
0410: 59 60 00 F5 80 51 1D 60 D9 60 00 F5 00 51 19 61
0420: 00 F5 40 51 00 F5 80 40 59 61 98 61 00 23 E3 60
0430: 09 4B D3 F8 B0 11 29 43 C3 F8 B0 11 D3 F8 B0 11
0440: 00 29 FB D1 BF E7 D3 F8 B0 21 00 2A FB D1 30 BD
0450: 00 30 00 20 20 20 00 20 00 00 2E 40 00 10 00 20
0460: 00 20 00 20 08 4B 00 22 6F F0 7F 41 1A 60 64 22
0470: 50 43 1A 46 13 68 C3 F1 FF 13 03 F5 7F 43 83 42
0480: F8 D3 8B 42 F6 D0 70 47 18 E0 00 E0 35 4A 00 20
0490: D2 F8 80 30 43 F0 03 03 70 B5 C2 F8 80 30 03 23
04A0: 31 4A 53 71 02 23 04 32 98 54 01 33 2F 49 16 2B
04B0: FA D1 2F 4B 0A 24 2F 4A D3 F8 20 32 23 F0 7F 43
04C0: 93 42 9C BF 0A 22 53 43 02 22 B3 FB F4 F0 04 FB
04D0: 10 33 30 33 8B 54 03 46 90 1C 0B B1 02 46 F4 E7
04E0: 21 4B 18 71 02 23 93 42 06 D2 C8 5C 8C 5C CC 54
04F0: 02 33 88 54 02 3A F6 E7 40 F6 61 72 1E 4B 1F 4C
0500: C3 F8 20 21 4F F0 00 43 63 63 1D 4B D3 F8 40 21
0510: 42 F0 02 02 C3 F8 40 21 D3 F8 40 51 18 4E 15 F0
0520: 02 05 F9 D1 4F F0 00 43 32 20 A3 63 FF F7 9A FF
0530: 4F F0 80 43 4F F4 20 72 29 46 A3 63 0A 23 25 60
0540: 10 4C C6 F8 A8 31 20 46 00 F0 5A F9 4F F4 81 03
0550: 23 60 4F F4 80 03 23 64 01 23 C6 F8 58 41 C6 F8
0560: 40 31 70 BD 00 C0 0F 40 00 10 00 20 04 10 00 20
0570: 00 44 1F 40 7F 96 98 00 00 80 0D 40 00 90 0D 40
0580: 00 00 2E 40 00 30 00 20 2D E9 F8 43 90 4D D5 F8
0590: 44 61 F0 07 C5 F8 44 61 0C D4 71 06 40 F1 10 81
05A0: 8B 4B D3 F8 AC 21 C3 F8 AC 21 D3 F8 BC 21 C3 F8
05B0: BC 21 FA E0 D5 F8 BC 31 BB B1 86 4C C5 F8 BC 31
05C0: 22 68 13 42 11 D0 84 4B 00 27 1A 68 83 4B 27 60
05D0: 9A 42 08 D1 01 20 FF F7 45 FF C5 F8 40 71 32 20
05E0: FF F7 40 FF FB BE 00 23 A3 76 79 4C D4 F8 AC 31
05F0: 25 46 27 46 00 2B D0 D0 79 49 C4 F8 AC 31 8B 6A
0600: D1 F8 2C E0 D4 F8 40 21 42 F4 00 52 C4 F8 40 21
0610: D4 F8 40 21 92 04 F5 D5 D5 F8 40 21 22 F4 00 52
0620: C5 F8 40 21 4F F0 01 12 C5 F8 B4 21 D7 F8 B4 21
0630: 12 F0 01 18 FA D1 9A B2 40 F2 81 60 82 42 0D D8
0640: B2 F5 D0 6F 3C D2 B2 F5 A0 6F 40 F0 A5 80 1B 0C
0650: 5B 06 43 F0 80 73 C5 F8 54 31 97 E0 40 F6 21 10
0660: 82 42 00 F0 85 80 40 F6 21 20 82 42 00 F0 8E 80
0670: B2 F5 10 6F 40 F0 90 80 C3 F3 07 43 DF F8 54 91
0680: 58 4A 40 46 89 F8 1B 30 57 4B C5 F8 C4 21 C5 F8
0690: C8 31 C5 F8 CC 21 4F F4 00 12 C5 F8 D0 31 4F F4
06A0: 80 03 C1 F8 00 21 42 46 C1 F8 C0 30 C1 F8 C0 31
06B0: C1 F8 00 32 41 46 FF F7 83 FE 89 F8 1A 80 71 E0
06C0: 1B 0C 4A 4A 1F FA 8E FC 90 68 00 28 64 D0 B2 F8
06D0: 00 80 98 45 4A D1 B2 F8 02 80 E0 45 46 D1 1B 0A
06E0: 03 2B 4F EA 1E 43 0C BF 02 78 92 88 93 42 28 BF
06F0: 13 46 93 B9 3E 4B 01 22 4F F0 80 08 8B 60 83 E8
0700: 04 01 00 22 9A 60 CA 60 D4 F8 B0 31 43 F0 01 03
0710: C4 F8 B0 31 2F 4B 1A 60 23 E0 1B 04 35 4A 4F F0
0720: 01 0E 43 F0 80 03 90 60 C2 F8 00 E0 53 60 00 F5
0730: 80 53 8A 64 D3 60 00 F5 00 53 13 61 00 F5 40 53
0740: 00 F5 80 40 53 61 00 23 90 61 CB 64 D4 F8 B0 31
0750: 43 F4 80 33 C4 F8 B0 31 D5 F8 B0 31 00 2B FB D1
0760: C8 E7 D5 F8 B0 31 00 2B FB D1 1B E0 0C 32 AB E7
0770: 21 4A 93 42 10 D1 21 4B 9E 45 0D D1 15 4B 9A 7E
0780: 52 B9 01 22 04 21 14 48 9A 76 02 E0 42 46 41 46
0790: 40 46 FF F7 15 FE 05 E0 D4 F8 C0 31 43 F0 01 13
07A0: C4 F8 C0 31 D4 F8 AC 31 24 E7 D3 F8 B0 11 08 4A
07B0: 00 29 FA D1 4F F0 FF 33 C2 F8 B4 31 D2 F8 84 31
07C0: 73 07 02 D5 02 4B D3 F8 84 31 BD E8 F8 83 00 BF
07D0: 00 00 2E 40 00 10 00 20 80 32 00 20 A9 45 C2 6B
07E0: 00 30 00 20 02 00 CC 00 CC 00 02 00 00 00 00 20
07F0: 20 20 00 20 00 20 00 20 21 09 00 03 01 00 04 00
0800: 02 44 03 46 93 42 02 D0 03 F8 01 1B FA E7 70 47
0810: 00 01 00 00 12 00 00 00 D1 00 00 20 00 02 00 00
0820: 49 00 00 00 88 00 00 20 00 22 01 00 21 00 00 00
0830: 21 01 00 20 00 21 01 00 09 00 00 00 BA 00 00 20
0840: 00 22 00 00 1C 00 00 00 42 01 00 20 00 21 00 00
0850: 09 00 00 00 9A 00 00 20 00 03 00 00 04 00 00 00
0860: 84 00 00 20 01 03 09 04 18 00 00 00 09 01 00 20
0870: 02 03 09 04 26 00 00 00 E3 00 00 20 03 03 09 04
0880: 16 00 00 00 04 10 00 20 00 00 00 00 00 00 00 00
0890: 00 00 00 00 04 03 09 04 09 02 49 00 02 01 00 C0
08A0: 32 09 04 00 00 02 03 00 00 00 09 21 11 01 00 01
08B0: 22 1C 00 07 05 83 03 40 00 01 07 05 04 03 40 00
08C0: 01 09 04 01 00 02 03 00 00 00 09 21 11 01 00 01
08D0: 22 21 00 07 05 81 03 40 00 01 07 05 02 03 20 00
08E0: 02 12 01 10 01 00 00 00 40 C0 16 86 04 79 02 01
08F0: 02 03 01 26 03 54 00 65 00 65 00 6E 00 73 00 79
0900: 00 64 00 75 00 69 00 6E 00 6F 00 20 00 52 00 61
0910: 00 77 00 48 00 49 00 44 00 18 03 54 00 65 00 65
0920: 00 6E 00 73 00 79 00 64 00 75 00 69 00 6E 00 6F
0930: 00 06 C9 FF 09 04 A1 5C 75 08 15 00 26 FF 00 95
0940: 40 09 75 81 02 95 20 09 76 91 02 95 04 09 76 B1
0950: 02 C0 06 AB FF 0A 00 02 A1 01 75 08 15 00 26 FF
0960: 00 95 40 09 01 81 02 95 40 09 02 91 02 C0 00 00

It is indeed a perfect match to the first 2416 bytes of hexdump from your board. So it seems something is going wrong when the restore process copies it back to the first sector of the flash chip. At least your board and probably all the other beta #2 boards have a correctly written restore image in their top flash sector. :) That's really good news, since I set the permanent sector lock in the flash chips, so those top 4096 bytes can never be erased or altered.

The problem Frank described in msg #2579 is probably also the restore process failing to copy the entire image correctly. Looks like I'll be making a bootloader update in the next week or two....
 
T3.6 26300

So T3... goes pretty fast :D

Yup, a *lot* of optimization work when into that.

Those lines are 33 bytes, when the speed prints as a 5 digit number. So 26300 lines/sec is 867.9 kbytes/sec.

When I run it here on a fast Linux desktop and through a good quality multi-TT hub, I see 31030 lines/sec. At 33 bytes/line, that's 1023.9 kbytes/sec. When everything is running very efficiently, you tend to see the bandwidth in a multiple of 64000 bytes/sec. The packets are 64 bytes. So 1024 kbytes/sec means the system is managing to fit 16 max size packets into each 1ms USB frame. The absolute theoretical USB maximum (not accounting for bit stuffing overhead) is 19 packets per frame.

This analysis and admittedly most of my device-side experience is based on 12 Mbit speed where the max packet size for bulk protocol is 64 bytes. On T4 we have 480 Mbit speed. The frames are 125 us. We're currently still using only 64 byte packets, but the protocol allows 512. On my to-do list is moving the different USB descriptors depending on the speed, so we can go up to 512 but still be compatible if 12 Mbit speed is used.

But before I really start on optimizations, I need to figure out why the latency test is failing in certain sizes. Probably a bug on my part...
 
@PaulStoffregen

Ran your speed test on a variety of boards besides the T4 for later comparison. Wow you are right abou the Arduino boards. Even my Nucleo-H7 board has some issues. Couldn't find my Arduino Due board - forgot where I put it.

Code:
	T4 (9f4d02c)	T3.6	T.3.5	2.0++	Uno	mega	Nucleo144-H7	101    DUE
Avg	    3999	26873.5	26877.5	8577	1268	1267.5	2818.5         475.5   376.5
min	    3998	26841	26837	8568	1267	1266	2818	        475     376
max	    4000	26906	26918	8586	1269	1269	2819	        476     377

With T4B2 1.47-beta1 and latest github core with printf's in USB core commented out, i get (lines/sec)
Code:
       T4B2  T3.6
avrg  25299  29083
min   23277  28982
max   25587  29092
host is linux laptop Ubuntu (16.04), and I cut/paste from Sermon.

Code:
                latency @ 1 bytes: 1.00 ms average, 1.11 maximum
                latency @ 2 bytes: 1.00 ms average, 1.34 maximum
                latency @ 12 bytes: 1.00 ms average, 1.08 maximum
                latency @ 30 bytes: 1.00 ms average, 1.04 maximum
                latency @ 62 bytes: 1.00 ms average, 1.08 maximum
                latency @ 71 bytes: 1.00 ms average, 1.05 maximum
                error reading result, r=0
                error reading result, r=0
                error reading result, r=0
                latency @ 128 bytes: 30.24 ms average, 1000.00 maximum
                latency @ 500 bytes: 0.97 ms average, 1.13 maximum
                latency @ 1000 bytes: 1.97 ms average, 2.08 maximum
                latency @ 2000 bytes: 3.94 ms average, 4.13 maximum
                latency @ 4000 bytes: 7.88 ms average, 10.26 maximum
                latency @ 8000 bytes: 15.52 ms average, 16.10 maximum

On an older Ubuntu 14.04 with USB hub (part of Dell monitor)
Code:
       T4B2  T3.6
avrg  25463  34738
min   25290  34612
max   25595  34910
redirected kermit output to a file for 5 seconds or do
stty -F /dev/ttyACM0 115200
cat /dev/ttyACM0 > /tmp/tmp

and ctrl-c after 5 seconds or so, and get rid of first 100000 lines or so.
Code:
T3.6 to linux64
tail -n 300000 /tmp/tmp | awk -F = '{print $3}' | stats
300000 pts min 35626.000000 max 35709.000000 avrg 35656.383903 cov 0.000697 rmse 24.860356
 
Last edited:
@manitou
I just commented all the printf's in usb_serial and still get the same lines/sec on the Windows10x64 machine. Is there some place else I need to comment out printf's? Your numbers are quite a bit different than what I am getting for the T4B2
 
@mjs513 - Look in usb.c line 597

I think I mentioned the output in post #2620... His results loook similar to what I was seeing except I did not update yet to continue after error...
 
@mjs513 - Look in usb.c line 597

I think I mentioned the output in post #2620... His results loook similar to what I was seeing except I did not update yet to continue after error...

Commented that line out as well but no change in lines/sec. Now here is something strange. I left all the printf's in and got the same lines/sec.
 
Your numbers are quite a bit different than what I am getting for the T4B2

Pretty sure this is a Linux vs Windows difference. I'm seeing ~11000 lines/sec here, regardless of the debug printing.

Which chip your computer's motherboard has could also make quite a difference. T4 is extremely fast but the code has no optimization at all. You can expect widely varying results depending on small USB details that normally wouldn't be very important!

To completely turn off all debug printing, comment out the first line in this file.

https://github.com/PaulStoffregen/cores/blob/master/teensy4/debug/printf.h

However, looks like some direct calls to printf_debug() slipped into the code some time ago. Fixed it just now. Grab the latest commit to be able to use this to turn off all Serial4 printing.
 
@manitou
I just commented all the printf's in usb_serial and still get the same lines/sec on the Windows10x64 machine. Is there some place else I need to comment out printf's? Your numbers are quite a bit different than what I am getting for the T4B2

Actually, I'm not sure how one is supposed to capture the output from the sketch? I was copying and pasting from Sermon on Ubuntu, but that seems like the speed of the display might be a factor ?? There seems to a be some jitter in lines/second. I'm not sure best way to capture free-running /dev/ttyACM0 ...
 
Pretty sure this is a Linux vs Windows difference. I'm seeing ~11000 lines/sec here, regardless of the debug printing.

Which chip your computer's motherboard has could also make quite a difference. T4 is extremely fast but the code has no optimization at all. You can expect widely varying results depending on small USB details that normally wouldn't be very important!

To completely turn off all debug printing, comment out the first line in this file.

https://github.com/PaulStoffregen/cores/blob/master/teensy4/debug/printf.h

However, looks like some direct calls to printf_debug() slipped into the code some time ago. Fixed it just now. Grab the latest commit to be able to use this to turn off all Serial4 printing.

I just downloaded the latest core change and no difference to speak of, instead of 3999 I am seeing 4000. I did try to comment that line out but like you said some got through so got errors ;) thank your for fixing.

I don't have any machines that run Linux so have nothing to compare to. Never realized that there would be that much variation. But feel better now that its not a problem on my end with USB. Its a Linux vs Windows issue from what you are indicating. Still some difference between the two OS's.
 
got this error
Code:
C:\Users\Walter\Documents\arduino-1.8.8\hardware\teensy\avr\cores\teensy4\usb_desc.h:212:28: error: 'ENDPOINT_RECEIVE_ONLY' undeclared (first use in this function)

   #define ENDPOINT2_CONFIG ENDPOINT_RECEIVE_ONLY
 
@mjs513 @Paul... -
If it Run it on my Linux box, (My Old dev machine I7 about gen 2) with those lines commented out... Have not picked up the last changes yet...

I think I am seeing the T4 output about 27500 lines/sec...

Now Same sketch moved to my Windows machine (did not reprogram T4)...
On Windows 10 machine running on Hub: 4000

Without hub still 4000.
Using Same HUb that was on Linux box still 4K

Using Same HUB with TyCommander: still 4K...

So it does look like Windows...

And for what it is worth if I connect it up to an RPI3 board and do cat /dev/ttyACM0
Which is going through my putty (KiTTy) window.

I am getting about 6950 lines/sec
 
Last edited:
@mjs513 @Paul... -
If it Run it on my Linux box, (My Old dev machine I7 about gen 2) with those lines commented out... Have not picked up the last changes yet...

I think I am seeing the T4 output about 27500 lines/sec...

Now Same sketch moved to my Windows machine (did not reprogram T4)...
On Windows 10 machine running on Hub: 4000

Without hub still 4000.
Using Same HUb that was on Linux box still 4K

Using Same HUB with TyCommander: still 4K...

So it does look like Windows...

@KurtE

Have no doubt that its a windows problem - just surprising since I never ran Linux before - sometime this week going to look for one of old laptops and actually install Linux but that is going to be a whole nother adventure for me :)
 
Hopefully the edited added feedback and stat points in the updated latency_test.exe will allow confirming functionality and give a quick way to watch perf changes by comaprison.

...

The new code has absolutely no optimizations yet, and it still appears to have some serious bugs. I'm about to start looking into why the latency test is failing with 64 bytes.

On benchmarking, here's another really simple speed test I wrote yesterday. Let it run for about 5 seconds, or until the lines/sec number becomes fairly stable.
...

Mods to that into lps_test.c { made for Windows Only } takes the SerMon out of the picture:

This is clearly a Windows limitation - it reads blocks of 8K chars and prints limited view and repeats for 50 times::
** NOT See following post - same code runs MUCH faster on T_3.x :: instead of 4K it gets 55K to 71K
Code:
T:\T_Downloads\pjrc_latency_test>lps_test.exe COM25
port COM25 opened

 #0 :
>>, lines/sec=3956
count=10229436, lines/sec=1991
count=10229437, lines/sec=1
c<<

// …

 #49 :
>>
count=10242165, lines/sec=3957
count=10242166, lines/sec=3957
count=10242167,<< ------         elapsed time 3.277 secs for 404800 bytes

Leave the provided #2626 sketch run on Teensy,
 

Attachments

  • lps_test.zip
    23.9 KB · Views: 93
Last edited:
Should have tried that on T_3.2 before posting:

#49 :
>>c=55520
count=10712847, lines/sec=55520
count=10712848, lines/sec=55520
count<<
TyCommander Sermon T_3.1:
Code:
count=20561130, lines/sec=26814
count=20561131, lines/sec=26814
count=20561132, lines/sec=26814

T_3.6 at 180 MHz:
#49 :
>>t=11525506, lines/sec=71516
count=11525507, lines/sec=71516
count=11525508, li<<
TyCommander Sermon T_3.6:
Code:
count=22780433, lines/sec=26862
count=22780434, lines/sec=26862
count=22780435, lines/sec=26862

TyCommander Sermon T_4:
Code:
count=10019464, lines/sec=3995
count=10019465, lines/sec=3995
count=10019466, lines/sec=3995
 
got this error
Code:
C:\Users\Walter\Documents\arduino-1.8.8\hardware\teensy\avr\cores\teensy4\usb_desc.h:212:28: error: 'ENDPOINT_RECEIVE_ONLY' undeclared (first use in this function)

So far, Serial is the only USB Type setting which compiles. Not going to work on the others until more issues are resolved.
 
I don't understand this. For 12 Mbit/sec USB, 36848 lines/sec should be the theoretical maximum.

I can't speak to that ... unless 50 iterations aren't enough to stabilize … Teensy sketch is printing that information? Perhaps a bug there in calc as on some occasions SerMon will show numbers faster than that on T_3.x. It will show numbers of 134K at startup and 44K T_3.1

Unless there is LOSS of data in the process? where buffered data is dumped?

The posted ZIP has the mods I made to get this from Windows. Somebody with Linux could examine my code and make it work on Linux. There may be settings on the port that don't preclude data dumping to bit bucket.

I'll go up the count from 50 to 100 ...
 
opps - it just was not stabilized with 50 8K reads. Here is the same with 200 64K reads:


T_3.1 Com 19:
Code:
 #199 :
>> lines/sec=25258
count=112147896, lines/sec=25258
count=112147897, lines/sec=2<< ------    elapsed time 15.180 secs for 1619200 bytes

T_3.6 Com 8:
Code:
 #199 :
>>299982629, lines/sec=25946
count=299982630, lines/sec=25946
count=299982631, l<< ------      elapsed time 15.034 secs for 1619200 bytes

No change for T_4-2 COM 25:
Code:
  #199 :
>>count=10500344, lines/sec=3967
count=10500345, lines/sec=3967
count=10500346, << ------        elapsed time 104.117 secs for 1619200 bytes

Source for :: lps_test.c - LINUX/MAC code not updated to above in receive_bytes_X()
Code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdarg.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>

// One of these must be defined, usually via the Makefile
//#define MACOSX
//#define LINUX
#define WINDOWS

#if defined(MACOSX) || defined(LINUX)
#include <termios.h>
#include <sys/select.h>
#define PORTTYPE int
#define BAUD B115200
#if defined(LINUX)
#include <sys/ioctl.h>
#include <linux/serial.h>
#endif
#elif defined(WINDOWS)
#include <windows.h>
#define PORTTYPE HANDLE
#define BAUD 115200
#define gettimeofday gettimeofday_
#else
#error "You must define the operating system\n"
#endif

// function prototypes
PORTTYPE open_port_and_set_baud_or_die(const char *name, long baud);
void close_port(PORTTYPE port);
int receive_bytes_X(PORTTYPE port, int cnt);
void delay(double sec);
void die(const char *format, ...) __attribute__ ((format (printf, 1, 2)));

int gettimeofday(struct timeval * tp, struct timezone * tzp);


int receive_bytes_X(PORTTYPE port, int cnt)
{
	int count=0;
	char buf[65537];
	int len;
	len = sizeof(buf)-1;
#if defined(MACOSX) || defined(LINUX)
	int r;
	int retry=0;

	// non-blocking read mode
	fcntl(port, F_SETFL, fcntl(port, F_GETFL) | O_NONBLOCK);
	while (count < len) {
		r = read(port, buf + count, len - count);
		//printf("read, r = %d\n", r);
		if (r < 0 && errno != EAGAIN && errno != EINTR) return -1;
		else if (r > 0) count += r;
		else {
			// no data available right now, must wait
			fd_set fds;
			struct timeval t;
			FD_ZERO(&fds);
			FD_SET(port, &fds);
			t.tv_sec = 1;
			t.tv_usec = 0;
			r = select(port+1, &fds, NULL, NULL, &t);
			//printf("select, r = %d\n", r);
			if (r < 0) return -1;
			if (r == 0) return count; // timeout
		}
		retry++;
		if (retry > 1000) return -100; // no input
	}
	fcntl(port, F_SETFL, fcntl(port, F_GETFL) & ~O_NONBLOCK);
#elif defined(WINDOWS)
	COMMTIMEOUTS timeout;
	DWORD n;
	BOOL r;
	int waiting=0;

	GetCommTimeouts(port, &timeout);
	timeout.ReadIntervalTimeout = MAXDWORD; // non-blocking
	timeout.ReadTotalTimeoutMultiplier = 0;
	timeout.ReadTotalTimeoutConstant = 0;
	SetCommTimeouts(port, &timeout);
	for ( int ii=0; ii<cnt; ii++ ) {
		count = 0;
		while (count < len) {
			r = ReadFile(port, buf + count, len - count, &n, NULL);
			if (!r) die("read error\n");
			if (n > 0) count += n;
			else {
				if (waiting) break;  // 1 sec timeout
				timeout.ReadIntervalTimeout = MAXDWORD;
				timeout.ReadTotalTimeoutMultiplier = MAXDWORD;
				timeout.ReadTotalTimeoutConstant = 1000;
				SetCommTimeouts(port, &timeout);
				waiting = 1;
			}
		}
		buf[80]=0;
		if ( !(ii%8) || (ii>cnt-8) ) printf( "\n #%d : \n>>%s<<", ii, buf );
	}
#endif
	return count;
}

int main(int argc, char **argv)
{
	PORTTYPE fd;
	int blkcnt = 200;

	if (argc < 2) die("Usage: :) latency_test3 <comport>\n");
	fd = open_port_and_set_baud_or_die(argv[1], BAUD);
//	if (argc > 2 && argv[2][0] >'0' && argv[2][0] <= '9' ) loopNum = argv[2][0] - '0';
//	if (argc > 3 && argv[3][0] >'0' && argv[3][0] <= '9' ) numLengths = ( numLengths / (argv[3][0] - '0'));
	printf("port %s opened\n", argv[1]);

	struct timeval begin, end;
	double elapsed;
	int totBytes;

	int r;
	// test begin
	gettimeofday(&begin, NULL);
	r = receive_bytes_X( fd, blkcnt );
	gettimeofday(&end, NULL);
	elapsed = (double)(end.tv_sec - begin.tv_sec) * 1000.0;
	elapsed += (double)(end.tv_usec - begin.tv_usec) / 1000.0;
	printf(" ------\t elapsed time %.3f secs for %d bytes\n", elapsed/1000.0, blkcnt*8096 );
	close_port(fd);
	return 0;
}


/**********************************/
/*  Serial Port Functions         */
/**********************************/


PORTTYPE open_port_and_set_baud_or_die(const char *name, long baud)
{
	PORTTYPE fd;
#if defined(MACOSX)
	struct termios tinfo;
	fd = open(name, O_RDWR | O_NONBLOCK);
	if (fd < 0) die("unable to open port %s\n", name);
	if (tcgetattr(fd, &tinfo) < 0) die("unable to get serial parms\n");
	if (cfsetspeed(&tinfo, baud) < 0) die("error in cfsetspeed\n");
	tinfo.c_cflag |= CLOCAL;
	if (tcsetattr(fd, TCSANOW, &tinfo) < 0) die("unable to set baud rate\n");
	fcntl(fd, F_SETFL, fcntl(fd, F_GETFL) & ~O_NONBLOCK);
#elif defined(LINUX)
	struct termios tinfo;
	struct serial_struct kernel_serial_settings;
	int r;
	fd = open(name, O_RDWR);
	if (fd < 0) die("unable to open port %s\n", name);
	if (tcgetattr(fd, &tinfo) < 0) die("unable to get serial parms\n");
	if (cfsetspeed(&tinfo, baud) < 0) die("error in cfsetspeed\n");
	if (tcsetattr(fd, TCSANOW, &tinfo) < 0) die("unable to set baud rate\n");
	r = ioctl(fd, TIOCGSERIAL, &kernel_serial_settings);
	if (r >= 0) {
		kernel_serial_settings.flags |= ASYNC_LOW_LATENCY;
		r = ioctl(fd, TIOCSSERIAL, &kernel_serial_settings);
		if (r >= 0) printf("set linux low latency mode\n");
	}
#elif defined(WINDOWS)
	COMMCONFIG cfg;
	COMMTIMEOUTS timeout;
	DWORD n;
	char portname[256];
	int num;
	if (sscanf(name, "COM%d", &num) == 1) {
		sprintf(portname, "\\\\.\\COM%d", num); // Microsoft KB115831
	} else {
		strncpy(portname, name, sizeof(portname)-1);
		portname[n-1] = 0;
	}
	fd = CreateFile(portname, GENERIC_READ | GENERIC_WRITE,
		0, 0, OPEN_EXISTING, 0, NULL);
	if (fd == INVALID_HANDLE_VALUE) die("unable to open port %s\n", name);
	GetCommConfig(fd, &cfg, &n);
	//cfg.dcb.BaudRate = baud;
	cfg.dcb.BaudRate = 115200;
	cfg.dcb.fBinary = TRUE;
	cfg.dcb.fParity = FALSE;
	cfg.dcb.fOutxCtsFlow = FALSE;
	cfg.dcb.fOutxDsrFlow = FALSE;
	cfg.dcb.fOutX = FALSE;
	cfg.dcb.fInX = FALSE;
	cfg.dcb.fErrorChar = FALSE;
	cfg.dcb.fNull = FALSE;
	cfg.dcb.fRtsControl = RTS_CONTROL_ENABLE;
	cfg.dcb.fAbortOnError = FALSE;
	cfg.dcb.ByteSize = 8;
	cfg.dcb.Parity = NOPARITY;
	cfg.dcb.StopBits = ONESTOPBIT;
	cfg.dcb.fDtrControl = DTR_CONTROL_ENABLE;
	SetCommConfig(fd, &cfg, n);
	GetCommTimeouts(fd, &timeout);
	timeout.ReadIntervalTimeout = 0;
	timeout.ReadTotalTimeoutMultiplier = 0;
	timeout.ReadTotalTimeoutConstant = 1000;
	timeout.WriteTotalTimeoutConstant = 0;
	timeout.WriteTotalTimeoutMultiplier = 0;
	SetCommTimeouts(fd, &timeout);
#endif
	return fd;

}



void close_port(PORTTYPE port)
{
#if defined(MACOSX) || defined(LINUX)
	close(port);
#elif defined(WINDOWS)
	CloseHandle(port);
#endif
}


/**********************************/
/*  Misc. Functions               */
/**********************************/

void delay(double sec)
{
#if defined(MACOSX) || defined(LINUX)
	usleep(sec * 1000000);
#elif defined(WINDOWS)
	Sleep(sec * 1000);
#endif
}


void die(const char *format, ...)
{
	va_list args;
	va_start(args, format);
	vfprintf(stderr, format, args);
	exit(1);
}

#if defined(WINDOWS)
static const unsigned __int64 epoch = ((unsigned __int64) 116444736000000000ULL);
int
gettimeofday_(struct timeval * tp, struct timezone * tzp)
{
    FILETIME    file_time;
    SYSTEMTIME  system_time;
    ULARGE_INTEGER ularge;

    GetSystemTime(&system_time);
    SystemTimeToFileTime(&system_time, &file_time);
    ularge.LowPart = file_time.dwLowDateTime;
    ularge.HighPart = file_time.dwHighDateTime;

    tp->tv_sec = (long) ((ularge.QuadPart - epoch) / 10000000L);
    tp->tv_usec = (long) (system_time.wMilliseconds * 1000);

    return 0;
}
#endif
 
I see you're using Windows GetSystemTime() as a replacement for Unix gettimeofday(). Not good. Use GetSystemTimePreciseAsFileTime(). Though for this sort of use, redesigning the code to use QueryPerformanceCounter() is probably the most correct way, and that would work on Windows 7 and earlier.

Still, I don't understand how the results could be so wrong. If this is still unresolved *much* later after the USB code has been pretty well optimized, remind me and maybe I'll look into it. But for now, trying to focus on debugging Teensy rather than debugging benchmarks.
 
Last edited:
Status
Not open for further replies.
Back
Top