Help understanding odd USB-Serial test results (Teensy 4.0)

Status
Not open for further replies.

sleeper

Member
I'm working on a project which involves a signal chain through multiple Teensy4's on a usb hub, so I started by testing to see if I could meet my bandwidth and latency requirements.

I wrote a usb-serial echo sketch for the teensy, deployed it to 4 teensies on a usb hub, and then wrote a low-level mac program to send N bytes of data to each serial port and also receive N bytes of last-cycle's data in a round-robin fashion on a precise 1ms interval.

I then went to tune N-bytes per cycle to find the limits. I was finding that the buffer was actually being corrupted in transit before any byte shortages/underflows. It was also a very intermittent corruption, sometimes not showing up until the 900th cycle, often much sooner however. The corruption was of the form of 64-byte chunks being rearranged or copied. I sent a buffer of 32 1's followed by 32 2's etc and what I'd get back would be made of the same values but in the wrong order, or with duplicated chunks.

So with buffer "N" up to 384 there we no such complications, it just worked. I tried increasing NUM_USB_BUFFERS to 24 (24*64 = 1536 bytes) but no improvement. I also switched form Serial.read to usb_serial_read as it looks like a more efficient call.

I went through many iterations of the serial-echo sketch before getting to this simple change which makes the problem go away:
To get stability with sizes up to 960, I have to insert this "if bread < N" into the code.
I was testing multiples of 192 so 960 was my upper limit before seeing underflows at the mac serial read.
That bandwidth is actually adequate for my needs, but it bugs me not understanding what's going on.

Code:
#include "usb_serial.h"

const int len = 4096;
int bread = 0, bwrite = 0;
char buff[len];

void setup()
{   Serial.begin(9600); // fake baud
}
void loop()
{   bread += usb_serial_read(buff+bread, len-bread);
    if(bread < 960) { return; }
    while(bwrite < bread)
    {   bwrite += usb_serial_write(buff+bwrite, bread-bwrite);
    }
    bwrite = bread = 0;
}
 
On Teensy 4.0 the 480 Mbps USB-Serial will be using 512B transfers AFAIK.

Not sure how that relates except the edit above for 24*64 wouldn't be right?

The T_4 USB stack has been updated and improved - but perhaps not yet perfected. It has been some weeks and a couple of TDuino releases since changes - before that I recall something about large transfers.

Not sure if those thread notes might be found for reference or if they still apply - it is also possible the high throughput is messing up on the MAC end?

If a sample can be made that exhibits the problem - especially on MAC I'm sure Paul would happily put it on his list when he can get back to that.

Is this done with TeensyDuino 1.51b1? No recent USB changes as noted - but as of now that is what the test and change would be against for a fix.
 
This is teensyduino 1.50, downloaded fresh 2 days ago.

Seems very unlikely (to me) to be an issue on the mac end as I'm using posix read() and write() in a very trivial manner and it's reporting sending and receiving the correct number of bytes.

The chunk-wise re-arranging/copying looks like some form of ring-buffer corruption with overflow or underflow.

One speculation I had was the while loop to synchronously write back was taking so long the internal input fifo was overflowing, and my if condition causes it to wait for a lull in input (in between full blocks sent by mac) to then send back.

I was looking at the 480 high speed mode, which seems to automatically enable in usb.c based on some runtime information:

Code:
if (status & USB_USBSTS_PCI) {
	if (USB1_PORTSC1 & USB_PORTSC1_HSP) {
		//printf("port at 480 Mbit\n");
		usb_high_speed = 1;
	} else {
		//printf("port at 12 Mbit\n");
		usb_high_speed = 0;
	}
}

Based on my data starting to underflow at 1100 bytes / ms, that feels a lot more like 12Mbit serial mode was active.
 
MAC should be fine if net throughput is under 15 MB/sec and running high speed.

Surprising if not running high_speed you can check that value in sketch it seems with :: extern "C" volatile uint8_t usb_high_speed; // non-zero if running at 480 Mbit/sec speed
Code:
extern "C" volatile uint8_t usb_high_speed; // non-zero if running at 480 Mbit/sec speed
setup() {
        while (!Serial);
	if (0!=usb_high_speed) {
		Serial.print("port at 480 Mbit\n");
	} else {
		Serial.print("port at 12 Mbit\n");
	}
// …
}

Paul works on MAC so your source there should work for him.

TD_1.50 is not old - just that 1.51 has to come out for updated IDE 1.8.12.
 
but it bugs me not understanding what's going on.

Me too.

Any chance you can give me enough info & code to reproduce this here. I have 5 macs here that I use for testing and developing Teensy's support for MacOS. They run MacOS versions 10.15, 10.14, 10.13, 10.12 and 10.7. I do not have any way to test on old MacOS 10.8 to 10.11.

There's probably no point using 1.51-beta1 if you're already on 1.50. The only significant change is support for Arduino 1.8.12.
 
But.. are you sure you are using POSIX read() and write() correctly?

A very common error I see is using if (read(ttyfd, buffer, bytes) < 0) error(); , assuming that read() will always read the specified amount of bytes or fail; it does not. Character devices and sockets can return a short count, and often do. If you too made this error, you'd see stale content at the end of your buffer from time to time.

Also, I'd put the ttys into nonblocking mode (via fcntl()) and use select() to see which ones are readable and/or writable; and noting that an attempt that fails with -1 with errno matching EAGAIN, EWOULDBLOCK, or possibly EINTR (if an interrupt was delivered to a handler using this thread that was not installed with the SA_RESTART flag) is not an error; it just meant that the kernel thought a read/write should succeed, but was (temporarily!) incorrect; just ignore these as if select() had not marked them readable/writable. I'd then attempt a read/write to each descriptors that were returned as readable/writable (separately, i.e. allowing a read and a write to the same descriptor in the same pass).

This said, while I do have two Teensy 4.0's, I haven't tested them yet in this manner. So, I am not claiming your POSIX C is at fault. I am only saying that I have seen way too many programmers make the above mistake about POSIX low-level I/O, and only want to make sure this is not the case here, as it could explain the behaviour you are seeing.
 
Modified defrag's sketch a bit, and it seems it is in fact going into 480 mode, so thats good I guess.

A little more about my setup: MBP 16 OSX 10.15
Tried with a usb-c dock-style usb adapter to just one T4.0, tried with a mini usb adapter, and tried with the mini adapter to a 4-port usb3 hub.
Don't have adapter-less option, but all my adapter behave the same.

The code compiles to a binary you call like:

Code:
ChainTest 6 /dev/cu.usbmodemWHATEVER /dev/cu.usbmodemWHATEVER2

The buffer size is 6 * 192 = 1152 in that example, which fails the test cases in the mac code.
If you pass it a single serial port, it runs the test with just one teensy, which I'm still seeing the same bandwidth cap somehow.

Read/Write drop outs mean the posix IO did not copy all bytes, so when that is non-0 the buffer error is expected sympathetic error.
The weird cases are when there are no read/write drops logged but the buffer is still corrupted.

So I'll just dump all the code on ya, it's fairly readable.

Here's a few hundred lines of code...

Code:
#pragma once

#include <stddef.h>
#include <string>

class SerialPort
{
public:
    // leaving baud at 0 leaves the port's baud rate unchanged
    SerialPort(char const* path, size_t baud = 0);

    ~SerialPort();

    SerialPort(SerialPort const& o) = delete;
    SerialPort & operator=(SerialPort const& o) = delete;

    SerialPort(SerialPort && o);
    SerialPort & operator=(SerialPort && o);

    int write(char const* data, size_t length);

    int read(char * data, size_t length);

    bool isOpen();
    void close();

private:
    static constexpr int INVALID = -1;
    
    void printErr(char const* prefix);

    int _fd = INVALID;
    std::string _path = "";
};

Code:
#include "Serial.h"

#include <string>
#include <iostream>
#include <map>
#include <stdexcept>

#include <fcntl.h> //open()
#include <unistd.h> //write(), read(), close()
#include <errno.h> //errno
#include <termios.h>

static std::map<size_t, speed_t> const& baudMap();
static speed_t getBaudId(size_t baud);

SerialPort::SerialPort(char const* path, size_t baud)
:   _path(path)
{
    _fd = ::open(path, (O_RDWR | O_NOCTTY | O_NDELAY));
    if(_fd < 0)
    {   std::cerr << "Unable to open serial port: " << path << "\n";
        return;
    }

    struct termios options;
    tcgetattr(_fd, &options);

    if(baud != 0)
    {   speed_t baudId = getBaudId(baud);

        if(baudId == B0)
        {   std::cerr << "Unsupported baud rate: " << baud << "\n";
            std::cerr << "Supported rates: [";
            std::string sep = "";
            for(auto const& p : baudMap())
            {   std::cerr << sep << p.second;
                sep = ", ";
            }
            std::cerr << "]\n";
            close();
            return;
        }

        cfsetispeed(&options, baudId);
        cfsetospeed(&options, baudId);
    }

    // Configure other settings
    // from: https://github.com/Marzac/rs232/blob/master/rs232-linux.c
    options.c_iflag &= ~(INLCR | ICRNL);
    options.c_iflag |= IGNPAR | IGNBRK;
    options.c_oflag &= ~(OPOST | ONLCR | OCRNL);
    options.c_cflag &= ~(PARENB | PARODD | CSTOPB | CSIZE | CRTSCTS);
    options.c_cflag |= CLOCAL | CREAD | CS8;
    options.c_lflag &= ~(ICANON | ISIG | ECHO);
    options.c_cc[VTIME] = 1;
    options.c_cc[VMIN]  = 0;
    
    if(tcsetattr(_fd, TCSANOW, &options) < 0)
    {   std::cerr << "Failed to update port settings.\n";
        close();
        return;
    }
}

SerialPort::~SerialPort()
{   close();
}

SerialPort::SerialPort(SerialPort && o)
{   *this = std::move(o);
}

SerialPort & SerialPort::operator=(SerialPort && o)
{   _fd = o._fd;
    _path = std::move(o._path);
    o._fd = INVALID;
    return *this;
}

int SerialPort::write(char const* data, size_t length)
{   if(!isOpen()) { return -1; }
    int ret = ::write(_fd, data, length);
    if(ret < 0) { printErr("SerialPort::write()"); }
    return ret;
}

int SerialPort::read(char * data, size_t length)
{   if(!isOpen()) { return -1; }
    int ret = ::read(_fd, data, length);
    if(ret < 0) { printErr("SerialPort::read()"); }
    return ret;
}

void SerialPort::close()
{   if(isOpen()) { ::close(_fd); }
    _fd = INVALID;
}

bool SerialPort::isOpen()
{   return _fd != INVALID;
}

void SerialPort::printErr(char const* prefix)
{   std::cerr << prefix
        << " fd: " << _fd 
        << " path: " << _path
        << " error: " << strerror(errno) << "\n";
}




static speed_t getBaudId(size_t baud)
{   auto iter = baudMap().find(baud);
    if(iter == baudMap().end()) { return B0; }
    return iter->second;
}

static std::map<size_t, speed_t> const& baudMap()
{   static std::map<size_t, speed_t> map = {
        {9600, B9600},
        {19200, B19200},
        {38400, B38400},
        {57600, B57600},
        {115200, B115200},
        {230400, B230400}
    };
    return map;
}

Code:
#include "Serial.h"

#include <string>
#include <iostream>
#include <vector>
#include <chrono>

#include <unistd.h>

using Vec = std::vector<char>;

// pass it number-of-streams and then a list of serial port paths
// payload size is no-streams * 192
int main(int argc, char ** argv)
{
    std::vector<SerialPort> ports;
    for(int i=2 ; i<argc ; i++)
    {   ports.emplace_back(argv[i]);
    }

    size_t streams = std::stoi(argv[1]);
    size_t sampleRate = 48000;
    size_t sampleBytes = 4;
    size_t frameRate = 1000;
    size_t bytes = sampleRate * sampleBytes * streams / frameRate;
    size_t frames = 2000;
    auto frameTime = std::chrono::microseconds(1000000 / frameRate);

    std::cout << "Sending " << bytes << " bytes per frame at " 
        << frameRate << " frames per second\n";

    Vec dump (bytes);

    Vec buff (bytes);
    for(size_t i=0 ; i<bytes ; i++)
    {   buff[i] = char(i/32);
    }
    Vec buff2 = buff;

    size_t readDrops = 0;
    size_t writeDrops = 0;
    size_t timeDrops = 0;

    auto target = std::chrono::high_resolution_clock::now();

    for(size_t f=0 ; f<frames ; f++)
    {
        for(SerialPort & p : ports)
        {
            int writeStat = p.write(buff.data(), bytes);
            writeDrops += (writeStat < (int)bytes);

            if(f == 0)
            {   while(p.read(dump.data(), bytes) > 0)
                { /* just drain on first frame */ }
                continue;
            }
            // else if(f == 1)
            // {   // create an extra frame of buffering/delay in the read path
            //     continue;
            // }

            int readStat = p.read(buff.data(), bytes);
            readDrops += (readStat < (int)bytes);

            if(buff != buff2)
            {   break;
            }
        }

        if(buff != buff2)
        {   std::cout << "buffer broke at frame: " << f << "\n";
            break;
        }

        target += frameTime;
        auto diff = target - std::chrono::high_resolution_clock::now();
        int usec = std::chrono::duration_cast<std::chrono::microseconds>(diff).count();
        if(usec > 0) { usleep(usec); }
        else { timeDrops ++; }
    }

    bool correct = (buff == buff2);

    std::cout << "write drop-outs: " << writeDrops << "\n";
    std::cout << "read drop-outs: " << readDrops << "\n";
    std::cout << "time slips: " << timeDrops << "\n";
    std::cout << "buff correct: " << correct << "\n";

    if(!correct)
    {   for(size_t i=0 ; i<bytes ; i+=32)
        {   std::cout << int(buff[i]) << " ";
        }
        std::cout << "\n";
    }
}
 
Last edited:
Update:

Switched the mac code to use blocking IO and loop over each read/write until all bytes copy, and it seems I'm getting substantially better bandwidth, up to 13*192=2496 bytes in and out per cycle over 4 T4's

Previously was just logging failure if non-blocking IO did not copy all bytes, and there's probably a mac buffer size like 1024 that was capping me.

But I'm still running into the mysterious case where the fd IO is reporting everything good but my buffer is still getting corrupted in transit.
 
Using
Code:
#include "usb_serial.h"
#define  MSGLEN  512

static unsigned char  msg[MSGLEN];
static uint32_t       msg_read;
static uint32_t       msg_written;

void setup() {
  Serial.begin(9600);
  msg_read = 0;
  msg_written = 0;
}

void loop() {
  int  bytes;
  
  if (Serial) {
    
    if (msg_read < MSGLEN) {
      bytes = usb_serial_read(msg + msg_read, MSGLEN - msg_read);
      if (bytes > 0) {
        msg_read += bytes;
      }
    }

    if (msg_written < msg_read) {
      bytes = usb_serial_write(msg + msg_written, msg_read - msg_written);
      if (bytes > 0) {
        msg_written += bytes;
      }
    }

    if (msg_read >= MSGLEN && msg_written >= MSGLEN) {
      msg_read = 0;
      msg_written = 0;
    }    
  }
}
on two Teensy 4.0's (using just-installed Teensyduino 1.50 on Arduino 1.8.10), in parallel on the same USB host port (hanging off a cheap USB 2.0 hub), using 512-byte messages and my own async benchmarker, I get 2.5 - 2.8 Mbytes/sec bandwidth on each, with a quarter-millisecond median round-trip latency on Linux (HP EliteBook 840 G4, running Linux 5.3.0-28-generic #30~18.04.1-Ubuntu SMP on x86_64), and no errors in packet contents. The host side generates a new pattern for each message using Xorshift64*.

There is obviously host overhead involved, since using 1024-byte messages on the host side (but the above sketch using 512), I get 3.6 - 4.0 Mbytes/sec bandwidth, still with quarter-millisecond median round-trip latency, but occasionally one of the Teensy 4.0's will glitch, returning incorrect content for some of the messages, and faster than normal.

Using even longer messages causes one or both Teensies to miss some incoming data, so the benchmark fails. This is almost consistent at 2048-byte messages, and quite consistent at 4096-byte messages on the host side (but the above sketch still using 512).

Note that "message" is kind of a misnomer here, because both the sketch responds immediately with any data it receives, and the host program does nonblocking I/O, and writes and reads simultaneously; the host program does not send the full message and then read, but reads during writes also:
Code:
#define _POSIX_C_SOURCE  200809L
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <signal.h>
#include <sys/select.h>
#include <termios.h>
#include <inttypes.h>
#include <string.h>
#include <stdio.h>
#include <time.h>
#include <errno.h>

/* Maximum number of devices to be benchmarked in parallel. */
#ifndef  MAX_DEVS
#define  MAX_DEVS  32
#endif

/* Microseconds to wait for I/O per round */
#ifndef  WAIT_US
#define  WAIT_US  1000
#endif

/* Signal handler */
static volatile sig_atomic_t  done = 0;

static inline void handle_done(int signum)
{
    done = 1;
}

static int install_done(int signum)
{
    struct sigaction  act;

    memset(&act, 0, sizeof act);
    sigemptyset(&act.sa_mask);
    act.sa_handler = handle_done;
    act.sa_flags = 0;

    return sigaction(signum, &act, NULL);
}

/* Device stages. */
enum {
    STAGE_READY = 0,
    STAGE_IO = 1,
    STAGE_DONE = 3,
};

/* Device data. */
static int              dev_count = 0;
static const char      *dev_path[MAX_DEVS];
static int              dev_fd[MAX_DEVS];
static struct termios   dev_termios[MAX_DEVS];

/* Device ping-pong state */
static int              dev_stage[MAX_DEVS];
static int              dev_errs[MAX_DEVS];    /* Errors */

/* Input buffers */
static uint64_t        *dev_inbuf[MAX_DEVS];
static size_t           dev_inlen[MAX_DEVS];   /* Bytes */

/* Output buffers */
static uint64_t         *dev_outbuf[MAX_DEVS];
static size_t            dev_outlen[MAX_DEVS];  /* Bytes */

/* Message roundtrip times per device */
static size_t            dev_rtts[MAX_DEVS];
static uint64_t         *dev_rtt[MAX_DEVS];
static uint64_t          dev_ns[MAX_DEVS];      /* Total duration */

/* Helper: Open device. */
static int dev_open(const char *path, struct termios *term)
{
    struct termios  oldterm;
    int             fd, saved_errno;

    /* Unnecessary sanity check... */
    if (!path || !*path) {
        errno = ENOENT;
        return -1;
    }

    /* Open device. */
    fd = open(path, O_RDWR | O_NONBLOCK);
    if (fd == -1)
        return -1;

    do {
        if (tcgetattr(fd, &oldterm) == -1)
            break;

        /* Save original terminal settings */
        if (term)
            memcpy(term, &oldterm, sizeof oldterm);

        oldterm.c_iflag = IGNBRK | IGNPAR;
        oldterm.c_oflag = 0;
        oldterm.c_cflag = CS8 | CREAD;
        oldterm.c_lflag = 0;
        oldterm.c_cc[VMIN] = 1;
        oldterm.c_cc[VTIME] = 0;

        cfsetispeed(&oldterm, B230400);
        cfsetospeed(&oldterm, B230400);

        /* Technically, this is not guaranteed to have
           set all the flags we asked; close enough for us. */
        if (tcsetattr(fd, TCSAFLUSH, &oldterm) == -1)
            break;

        /* Success. */
        return fd;

    } while (0);
    saved_errno = errno;
    close(fd);
    errno = saved_errno;
    return -1;
}

/* Helper: Wall clock "nanosecond" timer. */
static inline uint64_t  nanoseconds(void)
{
    struct timespec  now;
    if (clock_gettime(CLOCK_REALTIME, &now) == 0)
        return ((uint64_t)now.tv_sec * UINT64_C(1000000000)) + (uint64_t)now.tv_nsec;
    else
        return 0;
}

/* Helper: Compare elapsed nanoseconds. */
static int compare_nanoseconds(const void *p1, const void *p2)
{
    const uint64_t  ns1 = *(const uint64_t *)p1;
    const uint64_t  ns2 = *(const uint64_t *)p2;

    return (ns1 < ns2) ? -1 :
           (ns1 > ns2) ? +1 :
                          0 ;
}

/* Xorshift64* PRNG for filling in message payloads with pseudo-random data. */
static uint64_t  prng_state = 1;

static uint64_t  prng_u64(void)
{
    uint64_t  state = prng_state;
    state ^= state >> 12;
    state ^= state << 25;
    state ^= state >> 27;
    prng_state = state;
    return state * UINT64_C(2685821657736338717);
}

/* Helper: Fill buffer with pseudorandom data.
   Assumes at least an uint64_t of padding at end. */
static void  prng_fill(uint64_t *msg, size_t bytes)
{
    uint64_t *const end = (uint64_t *)((unsigned char *)msg + bytes);
    uint64_t       *u64 = (uint64_t *)msg;

    /* end may not be aligned, but that's okay; we never dereference it. */
    while (u64 < end)
        *(u64++) = prng_u64();
}

int main(int argc, char *argv[])
{
    size_t  msglen, msgcount, nsent, nreceived, nidle;
    time_t  updated;
    int     arg, i;
    char    dummy;

    if (argc < 4) {

        /* In case executed from another program without argv[0]. */
        if (argc < 1 || !argv[0])
            return EXIT_FAILURE;

        fprintf(stderr, "\n");
        fprintf(stderr, "Usage: %s MSGLEN COUNT DEVICE [ DEVICE ... DEVICE ]\n", argv[0]);
        fprintf(stderr, "\n");
        fprintf(stderr, "Where:\n");
        fprintf(stderr, "    MSGLEN  is the length of each message in bytes,\n");
        fprintf(stderr, "    COUNT   is the number of messages per device, and\n");
        fprintf(stderr, "    DEVICE  is one or more serial devices (ttys) to use.\n");
        fprintf(stderr, "\n");
        fprintf(stderr, "This program sends a message to each DEVICE, measuring the\n");
        fprintf(stderr, "wall-clock roundtrip time, COUNT times, also verifying\n");
        fprintf(stderr, "the integrity of the message payload.\n");
        fprintf(stderr, "\n");
        fprintf(stderr, "Up to %d DEVICEs can be benchmarked in parallel.\n", MAX_DEVS);
        fprintf(stderr, "\n");
        fprintf(stderr, "The program will run until interrupted with INT (Ctrl+C), TERM, or HUP signal.\n");
        fprintf(stderr, "\n");

        return EXIT_SUCCESS;
    }

    /* Parse message length. */
    if (sscanf(argv[1], " %zu %c", &msglen, &dummy) != 1 || msglen < 1) {
        fprintf(stderr, "%s: Invalid message length.\n", argv[1]);
        return EXIT_FAILURE;
    }

    /* Parse message count. */
    if (sscanf(argv[2], " %zu %c", &msgcount, &dummy) != 1 || msgcount < 1) {
        fprintf(stderr, "%s: Invalid number of messages.\n", argv[2]);
        return EXIT_FAILURE;
    }

    /* Install signal handlers, for aborting the benchmark. */
    if (install_done(SIGINT) == -1 ||
        install_done(SIGHUP) == -1 ||
        install_done(SIGTERM) == -1) {
        fprintf(stderr, "Cannot use signal handlers: %s.\n", strerror(errno));
        return EXIT_FAILURE;
    }

    /* Open each device. */
    for (arg = 3; arg < argc; arg++) {

        /* Skip empty device names. */
        if (!argv[arg] || !*argv[arg])
            continue;

        /* Too many devices? */
        if (dev_count >= MAX_DEVS) {
            fprintf(stderr, "Too many devices!\n");
            return EXIT_FAILURE;
        }

        /* Open device. */
        dev_path[dev_count] = argv[arg];
        dev_errs[dev_count] = 0;
        dev_fd[dev_count] = dev_open(argv[arg], &dev_termios[dev_count]);
        if (dev_fd[dev_count] == -1) {
            fprintf(stderr, "%s: %s.\n", argv[arg], strerror(errno));
            return EXIT_FAILURE;
        }

        if (dev_fd[dev_count] >= FD_SETSIZE) {
            fprintf(stderr, "%s: Too many devices for select() to support.\n", argv[arg]);
            return EXIT_FAILURE;
        }

        /* Pad the buffers so we can fill them with full uint64_t's. */
        dev_inbuf[dev_count] = malloc(msglen + sizeof (uint64_t));
        dev_outbuf[dev_count] = malloc(msglen + sizeof (uint64_t));
        if (!dev_inbuf[dev_count] || !dev_outbuf[dev_count]) {
            fprintf(stderr, "%s: Not enough memory for message buffers.\n", argv[arg]);
            return EXIT_FAILURE;
        }
        dev_inlen[dev_count] = 0;
        dev_outlen[dev_count] = 0;

        /* Allocate roundtrip time measurements. */
        dev_rtt[dev_count] = malloc((msgcount + 1) * sizeof dev_rtt[dev_count][0]);
        if (!dev_rtt[dev_count]) {
            fprintf(stderr, "%s: Not enough memory for roundtrip timing measurements.\n", argv[arg]);
            return EXIT_FAILURE;
        }
        dev_rtts[dev_count] = 0;

        dev_count++;
    }

    nidle = 0;
    nsent = 0;
    nreceived = 0;
    updated = (time_t)0;

    /* Benchmarking loop. */
    while (!done) {
        ssize_t  bytes;
        time_t   now;
        int      dones = 0, reads = 0, writes = 0;

        now = time(NULL);
        if (now != updated) {
            fprintf(stdout, "\r%zu sent, %zu received; %zu idle ", nsent, nreceived, nidle);
            fflush(stdout);
            updated = now;
        }

        for (i = 0; i < dev_count; i++) {

            if (dev_stage[i] == STAGE_READY) {
                /* Prepare a new message. */
                prng_fill(dev_outbuf[i], msglen);

                /* Record timestamp. */
                dev_rtt[i][dev_rtts[i]] = nanoseconds();

                /* If the very first message, record start time also. */
                if (dev_rtts[i] == 0)
                    dev_ns[i] = dev_rtt[i][0];

                /* Be ready to send this message immediately. */
                dev_stage[i] = STAGE_IO;
                dev_outlen[i] = 0;

                /* No data received yet from this device. */
                dev_inlen[i] = 0;
            }

            if (dev_stage[i] == STAGE_IO && dev_outlen[i] < msglen) {
                /* Try sending more data. */
                bytes = write(dev_fd[i], (char *)dev_outbuf[i] + dev_outlen[i], msglen - dev_outlen[i]);
                if (bytes > 0) {
                    dev_outlen[i] += bytes;
                    nsent += bytes;
                    writes++;
                } else
                if (bytes == -1 && (errno != EINTR && errno != EAGAIN && errno != EWOULDBLOCK)) {
                    fprintf(stderr, "Warning: %s: %s.\n", dev_path[i], strerror(errno));
                }
            }

            if (dev_stage[i] == STAGE_IO) {
                /* Try receiving more data. */
                if (dev_inlen[i] < msglen) {
                    bytes = read(dev_fd[i], (char *)dev_inbuf[i] + dev_inlen[i], msglen - dev_inlen[i]);
                    if (bytes > 0) {
                        dev_inlen[i] += bytes;
                        nreceived += bytes;
                        reads++;
                    }
                }
                /* All data received? */
                if (dev_inlen[i] >= msglen) {
                    /* Record roundtrip duration, */
                    dev_rtt[i][dev_rtts[i]] = nanoseconds() - dev_rtt[i][dev_rtts[i]];
 
                    /* Compare input and output buffers. */
                    if (memcmp(dev_inbuf[i], dev_outbuf[i], msglen))
                        dev_errs[i]++;

                    dev_rtts[i]++;

                    if (dev_rtts[i] < msgcount) {
                        /* Rinse and repeat. */
                        dev_stage[i] = STAGE_READY;
                    } else {
                        /* Record full exchange duration. */
                        dev_ns[i] = nanoseconds() - dev_ns[i];
                        /* This device is done; close it. */
                        dev_stage[i] = STAGE_DONE;
                        close(dev_fd[i]);
                        dev_fd[i] = -1;
                    }
                }
            }

            /* Count number of devices that have completed the benchmark. */
            if (dev_stage[i] == STAGE_DONE)
                dones++;
        }

        /* All done? */
        if (dones >= dev_count)
            break;

        /* If we did any I/O, there is no need to sleep. */
        if (reads > 0 || writes > 0)
            continue;

        nidle++;

#if defined(WAIT_US) && (WAIT_US-0 > 0)
        /* Sleep until I/O ought to be possible. */
        {
            struct timeval  waittime;
            fd_set  readfds, writefds, exceptfds;
            int     maxfd = -1;

            FD_ZERO(&readfds);
            FD_ZERO(&writefds);
            FD_ZERO(&exceptfds);

            waittime.tv_sec = (WAIT_US / 1000000);
            waittime.tv_usec = (WAIT_US % 1000000);

            for (i = 0; i < dev_count; i++) {
                if (dev_stage[i] == STAGE_IO) {
                    if (dev_inlen[i] < msglen) {
                        const int  fd = dev_fd[i];
                        FD_SET(fd, &readfds);
                        if (maxfd < fd)
                            maxfd = fd;
                    }
                    if (dev_outlen[i] < msglen) {
                        const int  fd = dev_fd[i];
                        FD_SET(fd, &writefds);
                        if (maxfd < fd)
                            maxfd = fd;
                    }
                }
            }

            if (maxfd > 0) {
                /* We don't actually care which devices are readable/writable,
                   since we do tentative reads and writes on each device. */
                (void)select(maxfd + 1, &readfds, &writefds, &exceptfds, &waittime);
            }
        }
#endif
    }

    /* Make sure all devices are closed. */
    for (i = 0; i < dev_count; i++) {
        if (dev_fd[i] != -1) {
            close(dev_fd[i]);
            dev_fd[i] = -1;
        }
    }

    /* Compile statistics. */
    for (i = 0; i < dev_count; i++) {
        /* Copy first roundtrip time to end of buffer. */
        dev_rtt[i][msgcount] = dev_rtt[i][0];
        /* Sort existing roundtrip times in ascending order. */
        if (dev_rtts[i] > 1) {
            qsort(dev_rtt[i], dev_rtts[i], sizeof dev_rtt[i][0], compare_nanoseconds);
        }
    }

    fprintf(stdout, "\r%zu sent, %zu received; %zu idle.\n", nsent, nreceived, nidle);

    /* Print statistics. */
    printf("#\tErrors\tBandwidth [bytes/sec]\tFirst [ms]\tRTTmin [ms]\tRTTmedian [ms]\tRTTmax [ms]\n");
    for (i = 0; i < dev_count; i++) {
        printf("%s\t%d\t", dev_path[i], dev_errs[i]);

        /* Total bandwidth known? */
        if (dev_stage[i] == STAGE_DONE && dev_ns[i] > 0) {
            printf("%.3f", (double)msgcount * (double)msglen * 1000000000.0 / (double)dev_ns[i]);
        } else {
            printf("N/A");
        }

        /* Roundtrips measured? */
        if (dev_rtts[i] > 0) {
            printf("\t%.3f\t%.3f\t%.3f\t%.3f\n",
                    (double)dev_rtt[i][msgcount] / 1000000.0,       /* First roundtrip in ms */
                    (double)dev_rtt[i][0] / 1000000.0,              /* Fastest roundtrip in ms */
                    (double)dev_rtt[i][dev_rtts[i]/2] / 1000000.0,  /* Median */
                    (double)dev_rtt[i][dev_rtts[i]-1] / 1000000.0); /* Maximum */
        } else {
            printf("\n");
        }
    }

    return EXIT_SUCCESS;
}
I used gcc-7.4.0 (gcc -Wall -O2 benchmark.c -o bench) to compile the host-side program. Run without arguments to see usage; the test cases I mentioned above are via ./bench 512 10000 /dev/ttyACM* et cetera.

My opinion is that the USB serial stack on Teensy 4.0 misses some incoming packets (or parts of the incoming stream), and/or fails to send some outgoing packets (or parts of the outgoing stream).
I'd need to wire a small display (OLED, maybe) to one or both to see the received/sent byte counts on the Teensy 4.0 end to verify, but I'm lazy, and because I don't actually need this to work right now myself, I'm hoping @PaulStoffregen will have time to look through/debug the USB serial stack on the Teensy 4.0 instead.
 
Status
Not open for further replies.
Back
Top