comms over USB serial

saaji

New member
I'm trying to send packets between a Teensy4.1 and linux system. If I understand how USB serial works on the Teensy4, Serial.write should directly use hi speed USB and allow for (in theory) 480 Mbps data transfer rate. On the linux side, I'm interfacing with the Teensy via a virtual com port (ttyACM0) and trying to measure write speeds using code from the benchmark at this link (https://www.pjrc.com/teensy/benchmark_usb_serial_receive.html). Timing writes to the VCP give me results of around 4 Mbits/sec, which is not remotely close to 480Mbps. I know 480Mbps is not feasible but I would expect at least several tens of Mbits per sec for writing. I am pretty stuck atm--the only lead I have is that the max defined baudrate in asm/termbits.h is 4 Mbps. But communication over the VCP should use native USB speeds, right? (the linux system and teensy are wired USB to USB directly). Any input would be much appreciated.
C:
// excerpt of benchmark code based on the teensy benchmark

int port = open("/dev/ttyACM0", O_RDWR | O_NOCTTY);
init_uart(port); // raw mode
config_uart_bitrate(port, 480000000) // uses termios2 to set bitrate, but shouldn't be necessary?

int bytes_sent = 0;
int payload = atoi(argv[1]); // desired payload in bytes, e.g. 1000000

int bytes[100000];
int size = 30000;

int loop_freq = 1000;
int cycle_time_micros = 1000000 / loop_freq;

struct timeval start;
gettimeofday(&start, NULL);

while (bytes_sent < payload) {
    struct timeval loop_start;
    gettimeofday(&loop_start, NULL);
    
    int num_written = write(port, bytes, size);
    if (num_written < size) {
        // write failed, log error
    } else {
        bytes_sent += num_written;
    }
    // stall until reach cycle time
    while (elapsed_micros(loop_start) < cycle_time_micros);
}

int time = elapsed_micros(start);
int time_s = (float)time / 1000000

// print bitrate, typically get results around 4mil
printf("rate: %f\n", (8 * bytes_sent) / time_s);
 
The problem is on the linux side. Yes, the Teensy 4.1 processor is capable of accessing the bus at full 480mbps speed. It won't ever really be able to saturate 480mbps for various reasons but it could do better than 4mbps by far. The problem is that ttyACM goes through the linux TTY system which was never meant for such fast interfaces. So, it tends to be a bottleneck at very high serial speeds. The solution, unfortunately, is likely to create a raw USB connection between the two so that the TTY system doesn't bottleneck you.
 
The problem is on the linux side. Yes, the Teensy 4.1 processor is capable of accessing the bus at full 480mbps speed. It won't ever really be able to saturate 480mbps for various reasons but it could do better than 4mbps by far. The problem is that ttyACM goes through the linux TTY system which was never meant for such fast interfaces. So, it tends to be a bottleneck at very high serial speeds. The solution, unfortunately, is likely to create a raw USB connection between the two so that the TTY system doesn't bottleneck you.
From my understanding (and inspecting the USB endpoints via lsusb), CDC/ACM devices use bulk USB transfers under the guise of an emulated COM port (ttyACM) for ease of use. if both devices are wired USB to USB, could I not expect native USB speeds? or am I misunderstanding how ACM works? Would I have to use something like libusb if tty is not actually feasible?
 
From my understanding (and inspecting the USB endpoints via lsusb), CDC/ACM devices use bulk USB transfers under the guise of an emulated COM port (ttyACM) for ease of use. if both devices are wired USB to USB, could I not expect native USB speeds? or am I misunderstanding how ACM works? Would I have to use something like libusb if tty is not actually feasible?

I wish I could remember where I saw discussion about this but I can't find it. Anyway, I believe you're correct about how it is implemented on a physical and low level. However, what happens is that your bulk transfers eventually have to go into the linux tty system itself. Then you can use read() write() and such to do I/O on the port as if it were a real serial port. But, this is where the slowness comes in. A system like that was built around serial interfaces that usually were not above 1Mbps. So, really the tty system itself was not meant for streaming 20MB/s through a serial port. If you want to transfer a lot of data in a hurry you are indeed better off doing direct transfers and skipping TTY altogether. This is more complicated but it lets you remove bottlenecks that usually are no big deal. After all, you can only type so fast, you can only read so fast. Serial ports are only so fast. The fact that modern MCUs can do native USB and pretend to be 480Mbps serial ports was not something that the writers of kernel serial port emulation seem to have foreseen.
 
The tty device handles one character at a time perhaps, rather than block transfers. Its designed to interface to a human or modem...
 
It's not that. The 4 to 5 MB/s limit occurs when you use Serial.write() for each byte to be sent, instead of filling a buffer (of up to Serial.availableForWrite() bytes) and send the filled buffer using Serial.write(buffer, bytes).

If you use a 32 byte or larger buffer, you can expect to reach 25 MB/s to 28 MB/s (25,000,000 to 28,000,000 bytes per second; corresponds to 200,000,000 to 224,000,000 bits per second of data).

Those numbers are from my XorShift64* test, which generates pseudorandom numbers on the Teensy, sending the 32 high bits of each generated number via USB Serial for as long as the serial connection is open (after receiving the 64-bit seed to use for the sequence). On the Linux side, the data is verified by comparing to the same sequence generated locally from the same seed. I've run this test for hours on end without glitches, and the abovementioned throughput is maintained each second. Thus, this is very much a real world test. I do believe using bulk transfers one can squeeze a bit more bandwidth, but haven't tested exactly how much.

The Linux TTY layer (see man 3 termios) does limit the bandwidth to somewhere above 30 MB/s or so depending on the processor and bus implementation, as one can verify by simply creating a pseudoterminal master-slave pair: it acts like a pipe, except it goes through the same TTY layer, and exhibits similar bandwidth limitations.

If you want, I can dig out my USB Serial test sketch and corresponding Linux C program, and post them here (under CC0-1.0).
 
Here is the sketch I use for Teensy USB Serial benchmarking:
C:
// SPDX-License-Identifier: CC0-1.0

// Size of outgoing buffer
constexpr size_t  packet_size = 32;

constexpr size_t  packet_words = packet_size / 4;
uint32_t          packet[packet_words];

// Xorshift64* pseudo-random number generator state
static uint64_t  prng_state;

// Return only the upper 32 bits; this passes BigCrunch tests
static inline uint32_t  prng_u32(void)
{
  uint64_t  x = prng_state;
  x ^= x >> 12;
  x ^= x << 25;
  x ^= x >> 27;
  prng_state = x;
  return (x * UINT64_C(2685821657736338717)) >> 32;
}

void setup() {
  // Zero state is invalid, causing only zeroes to be generated.
  prng_state = 0;
}

void loop() {
  if (!Serial) {
    // No serial connection.  Abort anything ongoing.
    prng_state = 0;
  
  } else
  if (prng_state) {
    size_t  n = Serial.availableForWrite() / 4;
#if 0
    while (n-->0) {
      uint32_t  u = prng_u32();
      Serial.write(u & 255);
      u >>= 8;
      Serial.write(u & 255);
      u >>= 8;
      Serial.write(u & 255);
      u >>= 8;
      Serial.write(u & 255);
    }
#else
    if (n > packet_words)
      n = packet_words;
    if (n > 0) {
      for (size_t i = 0; i < n; i++)
        packet[i] = prng_u32();
      Serial.write((char *)packet, n * 4);
    }
#endif
  } else
  if (Serial.available() >= 8) {
    char  buf[8];
    if (Serial.readBytes(buf, 8) == 8) {
      prng_state = ((uint64_t)((unsigned char)buf[0])      )
                 | ((uint64_t)((unsigned char)buf[1]) <<  8)
                 | ((uint64_t)((unsigned char)buf[2]) << 16)
                 | ((uint64_t)((unsigned char)buf[3]) << 24)
                 | ((uint64_t)((unsigned char)buf[4]) << 32)
                 | ((uint64_t)((unsigned char)buf[5]) << 40)
                 | ((uint64_t)((unsigned char)buf[6]) << 48)
                 | ((uint64_t)((unsigned char)buf[7]) << 56);
    }
  }
}
If you replace the #if 0 above with #if 1, the code will use Serial.write(value), and you'll only get 4-5 MB/s. With #if 0, it uses a buffer of up to packet_size bytes, and on my machine consistently transfers 27,500,000 – 29,100,000 bytes per second, depending on the load of my ye olde laptop (Intel Core i5-7200U).

Note that the Xorshift64* is a fast-but-good generator. Computing the sequence of pseudorandom numbers does take a bit of time, so it is quite possible that transferring data from e.g. a DMA buffer via USB serial can do slightly more bytes per second. This is just the simplest "real world case" test that I myself think provides useful data.

Here is the corresponding C program to compile and run on Linux:
C:
// SPDX-License-Identifier: CC0-1.0
// gcc -Wall -O2 this.c -lrt -o this && ./this /dev/ttyACM0
#define  _POSIX_C_SOURCE  200809L
#define  _DEFAULT_SOURCE
#include <stdlib.h>
#include <inttypes.h>
#include <unistd.h>
#include <sys/ioctl.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <termios.h>
#include <signal.h>
#include <time.h>
#include <string.h>
#include <stdio.h>
#include <errno.h>

/* Input buffer size.  Use a power of two larger than 512; I'd recommend 4096 - 65536.
*/
#ifndef  BUFFER_SIZE
#define  BUFFER_SIZE  512
#endif

/* Units of MB: 1000000 or 1048576
*/
#ifndef  MEGA
#define  MEGA  1000000
#endif

/* Xorshift64* pseudo-random number generator.  Zero state is invalid.
*/
static uint64_t  prng_state = 0;

/* Return only the upper 32 bits of XorShift64*. This passes BigCrunch tests.
*/
static inline uint32_t  prng_u32(void)
{
  uint64_t  x = prng_state;
  x ^= x >> 12;
  x ^= x << 25;
  x ^= x >> 27;
  prng_state = x;
  return (x * UINT64_C(2685821657736338717)) >> 32;
}

/* Generate a 64-bit seed for the pseudo-random number generator.
*/
static uint64_t  prng_seed(void)
{
    struct timespec  now, boot;
    pid_t            pid;
    uint64_t         seed;

    pid = getpid();

    do {
        clock_gettime(CLOCK_REALTIME, &now);
        clock_gettime(CLOCK_BOOTTIME, &boot);

        seed = (uint64_t)pid * UINT64_C(21272683)
             + (uint64_t)now.tv_sec * UINT64_C(113090255381)
             + (uint64_t)now.tv_nsec * UINT64_C(2470424306258081)
             + (uint64_t)boot.tv_sec * UINT64_C(9526198845596401)
             + (uint64_t)boot.tv_nsec * UINT64_C(9454161880954729);
    } while (!seed);

    prng_state = seed;
    return seed;
}

/* When 'done' becomes nonzero, it is time to stop the measurement.
*/
static volatile sig_atomic_t  done = 0;

/* Signal handler for 'done'.
*/
static void handle_done(int signum)
{
    // Silence unused variable warning; generates no code.
    (void)signum;

    done = 1;
}

static int install_done(int signum)
{
    struct sigaction  act;
    memset(&act, 0, sizeof act);
    sigemptyset(&act.sa_mask);
    act.sa_handler = handle_done;
    act.sa_flags = 0;  // Specifically, NO SA_RESTART flag.
    return sigaction(signum, &act, NULL);
}

/* One-second interval timer.  Simply sets 'update' to nonzero.
*/
#ifndef  UPDATE_SIGNAL
#define  UPDATE_SIGNAL  (SIGRTMIN+0)
#endif

static timer_t                update_timer;
static volatile sig_atomic_t  update = 0;

static void handle_update(int signum)
{
    (void)signum;
    update = 1;
}

static int install_update(void)
{
    struct itimerspec spec;
    struct sigevent   ev;
    struct sigaction  act;

    memset(&act, 0, sizeof act);
    sigemptyset(&act.sa_mask);
    act.sa_handler = handle_update;
    act.sa_flags = SA_RESTART;
    if (sigaction(UPDATE_SIGNAL, &act, NULL) == -1)
        return -1;

    ev.sigev_notify = SIGEV_SIGNAL;
    ev.sigev_signo  = UPDATE_SIGNAL;
    ev.sigev_value.sival_ptr = NULL;
    if (timer_create(CLOCK_BOOTTIME, &ev, &update_timer) == -1)
        return -1;

    spec.it_value.tv_sec = 1;       // One second to first update
    spec.it_value.tv_nsec = 0;
    spec.it_interval.tv_sec = 1;    // Repeat at one second intervals
    spec.it_interval.tv_nsec = 0;
    if (timer_settime(update_timer, 0, &spec, NULL) == -1)
        return -1;

    return 0;
}

/* USB serial port device handling.
*/
static struct termios   tty_settings;
static int              tty_descriptor = -1;
static const char      *tty_path = NULL;

static void tty_cleanup(void)
{
    if (tty_descriptor != -1) {
        if (tcsetattr(tty_descriptor, TCSANOW, &tty_settings) == -1)
            fprintf(stderr, "Warning: %s: Cannot reset original termios settings: %s.\n", tty_path, strerror(errno));

        tcflush(tty_descriptor, TCIOFLUSH);

        if (close(tty_descriptor) == -1)
            fprintf(stderr, "Warning: %s: Error closing device: %s.\n", tty_path, strerror(errno));

        tty_descriptor = -1;
    }
}

static int tty_open(const char *path)
{
    struct termios  raw;
    int             fd;

    // NULL or empty path is invalid, and yields "no such file or directory" error.
    if (!path || !*path) {
        errno = ENOENT;
        return -1;
    }

    // Fail if tty is already open.
    if (tty_descriptor != -1) {
        errno = EALREADY;
        return -1;
    }

    // Open the tty device.
    do {
        fd = open(path, O_RDWR | O_NOCTTY | O_CLOEXEC);
    } while (fd == -1 && errno == EINTR);
    if (fd == -1)
        return -1;

    // Set exclusive mode, so that others cannot open the device while we have it open.
    if (ioctl(fd, TIOCEXCL) == -1)
        fprintf(stderr, "Warning: %s: Cannot get exclusive access on tty device: %s.\n", path, strerror(errno));

    // Drop any already pending data.
    tcflush(fd, TCIOFLUSH);

    // Obtain current termios settings.
    if (tcgetattr(fd, &raw) == -1 || tcgetattr(fd, &tty_settings) == -1) {
        fprintf(stderr, "%s: Cannot get termios settings: %s.\n", path, strerror(errno));
        close(fd);
        errno = 0; // Already reported
        return -1;
    }

    // Raw 8-bit mode: no post-processing or special characters, 8-bit data.
    raw.c_iflag &= ~( IGNBRK | BRKINT | PARMRK | INPCK | ISTRIP | INLCR | IGNCR | ICRNL | IXON | IUCLC | IUTF8 );
    raw.c_oflag &= ~( OPOST );
    raw.c_lflag &= ~( ECHO | ECHONL | ICANON | ISIG | IEXTEN );
    raw.c_cflag &= ~( CSIZE | PARENB | CLOCAL );
    raw.c_cflag |= CS8 | CREAD | HUPCL;
    // Blocking reads.
    raw.c_cc[VMIN] = 1;
    raw.c_cc[VTIME] = 0;
    if (tcsetattr(fd, TCSANOW, &raw) == -1) {
        fprintf(stderr, "%s: Cannot set termios settings: %s.\n", path, strerror(errno));
        close(fd);
        errno = 0; // Already reported
        return -1;
    }

    // Drop any already pending data, again.  Just to make sure.
    tcflush(fd, TCIOFLUSH);

    // Everything seems to be in order.  Update state for tty_cleanup(), and return success.
    tty_descriptor = fd;
    tty_path = path;
    return 0;
}

static inline double  seconds_between(const struct timespec after, const struct timespec before)
{
    return (double)(after.tv_sec - before.tv_sec)
         + (double)(after.tv_nsec - before.tv_nsec) / 1000000000.0;
}

int main(int argc, char *argv[])
{
    const size_t     buffer_size = BUFFER_SIZE;
    size_t           buffer_have = 0;
    unsigned char   *buffer_data = NULL;
    struct timespec  started, mark;
    uint64_t         received_before = 0;   // Received till mark
    uint64_t         received = 0;          // Received after mark
    uint64_t         sequence = 0;
    uint64_t         seed;

    if (argc != 2 || !strcmp(argv[1], "-h") || !strcmp(argv[1], "--help")) {
        const char *arg0 = (argc > 0 && argv && argv[0] && argv[0][0] != '\0') ? argv[0] : "(this)";
        fprintf(stderr, "\n");
        fprintf(stderr, "Usage: %s [ -h | --help ]\n", arg0);
        fprintf(stderr, "       %s DEVICE\n", arg0);
        fprintf(stderr, "\n");
        fprintf(stderr, "This writes a 64-bit XorShift64* seed to DEVICE, then reads the\n");
        fprintf(stderr, "sequence of the generated numbers, 32 high bits each, verifying\n");
        fprintf(stderr, "and reporting the transfer rate.  Press CTRL+C or send SIGHUP,\n");
        fprintf(stderr, "SIGINT, or SIGTERM signal to exit.\n");
        fprintf(stderr, "\n");
        return (argc == 1 || argc == 2) ? EXIT_SUCCESS : EXIT_FAILURE;
    }

    if (install_done(SIGHUP) ||
        install_done(SIGINT) ||
        install_done(SIGTERM)) {
        fprintf(stderr, "Cannot install signal handlers: %s.\n", strerror(errno));
        return EXIT_FAILURE;
    }

    buffer_data = malloc(buffer_size + 4);
    if (!buffer_data) {
        fprintf(stderr, "Not enough memory available for a %zu-byte input buffer.\n", buffer_size);
        return EXIT_FAILURE;
    }

    seed = prng_seed();

    if (tty_open(argv[1])) {
        if (errno)
            fprintf(stderr, "%s: Cannot open device: %s.\n", argv[1], strerror(errno));
        return EXIT_FAILURE;
    }

    {
        unsigned char  request[8] = {   seed        & 255,
                                       (seed >>  8) & 255,
                                       (seed >> 16) & 255,
                                       (seed >> 24) & 255,
                                       (seed >> 32) & 255,
                                       (seed >> 40) & 255,
                                       (seed >> 48) & 255,
                                       (seed >> 56) & 255 };
        const unsigned char *const q = request + 8;
        const unsigned char       *p = request;
        ssize_t                    n;

        while (p < q) {
            n = write(tty_descriptor, p, (size_t)(q - p));
            if (n > 0) {
                p += n;
            } else
            if (n != -1) {
                fprintf(stderr, "%s: Invalid write (%zd)\n", tty_path, n);
                break;
            } else
            if (errno != EINTR) {
                fprintf(stderr, "%s: Write error: %s.\n", tty_path, strerror(errno));
                break;
            }
        }

        if (p != q) {
            tty_cleanup();
            return EXIT_FAILURE;
        }
    }

    if (install_update()) {
        fprintf(stderr, "Cannot create a periodic update signal: %s.\n", strerror(errno));
        tty_cleanup();
        return EXIT_FAILURE;
    }

    if (clock_gettime(CLOCK_BOOTTIME, &started) == -1) {
        fprintf(stderr, "Cannot read BOOTTIME clock: %s.\n", strerror(errno));
        tty_cleanup();
        return EXIT_FAILURE;
    } else
        mark = started;

    while (!done) {
        if (update) {
            struct timespec  now;

            if (clock_gettime(CLOCK_BOOTTIME, &now) == -1) {
                fprintf(stderr, "Cannot read BOOTTIME clock: %s.\n", strerror(errno));
                tty_cleanup();
                return EXIT_FAILURE;
            }

            const double  sec_last = seconds_between(now, mark);
            const double  mib_last = (double)received / (double)(MEGA);
            const double  sec_total = seconds_between(now, started);
            const double  mib_total = (double)(received + received_before) / (double)(MEGA);

            if (sec_last > 0.0 && sec_total > 0.0) {
                printf("%.3f MB in %.0f seconds (%.3f MB/s on average); %.3f MB in last %.3f seconds (%.3f MB/s); %" PRIu64 " numbers verified\n",
                       mib_total, sec_total, mib_total/sec_total,
                       mib_last, sec_last, mib_last/sec_last,
                       sequence);
                fflush(stdout);
            }

            received_before += received;
            received = 0;
            mark = now;
            update = 0;
        }

        // Receive more data?
        if (buffer_have < buffer_size) {
            ssize_t  n = read(tty_descriptor, buffer_data + buffer_have, buffer_size - buffer_have);
            if (n > 0) {
                buffer_have += n;
                received += n;
            } else
            if (n != -1) {
                fprintf(stderr, "%s: Unexpected read error (%zd).\n", tty_path, n);
                tty_cleanup();
                return EXIT_FAILURE;
            } else
            if (errno != EINTR) {
                fprintf(stderr, "%s: Read error: %s.\n", tty_path, strerror(errno));
                tty_cleanup();
                return EXIT_FAILURE;
            }
        }

        // Verify all full words thus far received.
        if (buffer_have > 3) {
            const unsigned char       *next = buffer_data;
            const unsigned char *const ends = buffer_data + buffer_have;

            while (next + 4 <= ends) {
                uint32_t  u =  (uint32_t)(next[0])
                            | ((uint32_t)(next[1]) << 8)
                            | ((uint32_t)(next[2]) << 16)
                            | ((uint32_t)(next[3]) << 24);
                if (u == prng_u32()) {
                    sequence++;
                    next += 4;
                } else {
                    fprintf(stderr, "Data mismatch at %" PRIu64 ". generated number.\n", sequence + 1);
                    tty_cleanup();
                    return EXIT_FAILURE;
                }
            }

            if (next < ends) {
                memmove(buffer_data, next, (size_t)(ends - next));
                buffer_have = (size_t)(ends - next);
            } else {
                buffer_have = 0;
            }
        }
    }

    tty_cleanup();
    return EXIT_SUCCESS;
}
 
@Nominal Animal : I'll give this a try later today. I have one fairly powerful laptop (modern gaming machine . . . I don't game much, but before retirement, I did lots of linux software development, & typically tested with anywhere from three to fifteen virtual machines all running on this same laptop, exchanging lots of large messages over ethernet - very resource intensive), and one not-so-powerful laptop, so it will be interesting to compare relative results.

Mark J Culross
KD5RXT
 
[ Please see post #11 below for a more complete report, including results for both block writes & single writes ]

OK, I've gathered my local USB serial benchmark testing results for analysis/comparison.

Teensy hardware setup:

Teensy 4.0 running sketch given in post #7 above (as posted, unmodified)

PC hardware setup #1:

ASUS N56VM
Intel(R) Core(TM) i7-3610Q CPU @2.30GHz (3 cores)
8.00 GB RAM
Has one internal serial port, so used /dev/ttyACM1 for testing on the PC end
PC executing C-code given in post #7 above (as posted, unmodified, built as instructed at the top of the source file)
PC booting/running RHEL8.6 in rescue (text) mode from a 512GB bootable SanDisk Extreme Pro USB 3.1 stick built as a "live CD"

Results with PC hardware setup #1:

Code:
25.686 MB in 1 seconds (25.685 MB/s on average); 25.686 MB in last 1.000 seconds (25.685 MB/s); 6421376 numbers verified
52.164 MB in 2 seconds (26.081 MB/s on average); 26.479 MB in last 1.000 seconds (26.477 MB/s); 13041024 numbers verified
78.524 MB in 3 seconds (26.174 MB/s on average); 26.360 MB in last 1.000 seconds (26.361 MB/s); 19630976 numbers verified
104.876 MB in 4 seconds (26.219 MB/s on average); 26.352 MB in last 1.000 seconds (26.352 MB/s); 26218880 numbers verified

PC hardware setup #2:

ASUS FX507Z
12th Gen Intel(R) Core(TM) i7-12700H 2.30 GHz (13 cores)
64.0 GB RAM
Has no internal serial ports, so used /dev/ttyACM0 for testing on the PC end
PC executing C-code given in post #7 above (as posted, unmodified, built as instructed at the top of the source file)
PC booting/running RHEL8.6 in rescue (text) mode from a 512GB bootable SanDisk Extreme Pro USB 3.1 stick built as a "live CD"

Results with PC hardware setup #2:

Code:
29.423 MB in 1 seconds (29.425 MB/s on average); 29.423 MB in last 1.000 seconds (29.425 MB/s); 7355776 numbers verified
59.280 MB in 2 seconds (29.642 MB/s on average); 29.857 MB in last 1.000 seconds (29.858 MB/s); 14819968 numbers verified
89.148 MB in 3 seconds (29.716 MB/s on average); 29.868 MB in last 1.000 seconds (29.866 MB/s); 22286976 numbers verified
118.972 MB in 4 seconds (29.744 MB/s on average); 29.824 MB in last 1.000 seconds (29.826 MB/s); 29742976 numbers verified

Let me know if I left anything pertinent out, and/or if I can provide more info.

Mark J Culross
KD5RXT
 
Last edited:
If you want, you could change the Teensy sketch (replacing #if 0 with #if 1), with no changes to the Linux program. You should see the transfer rate drop down to 4-5 MB/s.

(To repeat, the only difference is that the #if 1 version uses Serial.write() for each byte separately on Teensy; the current #if 0 version uses a buffer of up to packet_size = 32 bytes.)

My claim is that on Teensy 4.x, Serial.write() has to be used with buffers (of at least 32 bytes each) to achieve 25+ MB/s. When used to write individual bytes, the rate drops down to 4-5 MB/s. The exact rates depend on hardware and kernel, but all OSes should exhibit similar behaviour (buffers needed to achieve maximum transfer rate). It is unlikely but possible that the Linux TTY layer makes it particularly sensitive to this, with the difference in rates somewhat smaller on MacOS and Windows. I only use Linux myself so I haven't verified that.
 
I've now gathered a more complete set of results from my local USB serial benchmark testing for analysis/comparison.

Teensy hardware setup #1 (block writes):

Teensy 4.0 running sketch given in post #7 above (as posted, unmodified i.e. using block writes, with the #if 0 in place)

PC hardware setup #1:

ASUS N56VM
Intel(R) Core(TM) i7-3610Q CPU @2.30GHz (3 cores)
8.00 GB RAM
Has one internal serial port, so used /dev/ttyACM1 for testing on the PC end
PC executing C-code given in post #7 above (as posted, unmodified, built as instructed at the top of the source file)
PC booting/running RHEL8.6 in rescue (text) mode from a 512GB bootable SanDisk Extreme Pro USB 3.1 stick built as a "live CD"

Results with PC hardware setup #1 (block writes):

Code:
25.686 MB in 1 seconds (25.685 MB/s on average); 25.686 MB in last 1.000 seconds (25.685 MB/s); 6421376 numbers verified
52.164 MB in 2 seconds (26.081 MB/s on average); 26.479 MB in last 1.000 seconds (26.477 MB/s); 13041024 numbers verified
78.524 MB in 3 seconds (26.174 MB/s on average); 26.360 MB in last 1.000 seconds (26.361 MB/s); 19630976 numbers verified
104.876 MB in 4 seconds (26.219 MB/s on average); 26.352 MB in last 1.000 seconds (26.352 MB/s); 26218880 numbers verified

PC hardware setup #2 (block writes):

ASUS FX507Z
12th Gen Intel(R) Core(TM) i7-12700H 2.30 GHz (13 cores)
64.0 GB RAM
Has no internal serial ports, so used /dev/ttyACM0 for testing on the PC end
PC executing C-code given in post #7 above (as posted, unmodified, built as instructed at the top of the source file)
PC booting/running RHEL8.6 in rescue (text) mode from a 512GB bootable SanDisk Extreme Pro USB 3.1 stick built as a "live CD"

Results with PC hardware setup #2 (block writes):

Code:
29.423 MB in 1 seconds (29.425 MB/s on average); 29.423 MB in last 1.000 seconds (29.425 MB/s); 7355776 numbers verified
59.280 MB in 2 seconds (29.642 MB/s on average); 29.857 MB in last 1.000 seconds (29.858 MB/s); 14819968 numbers verified
89.148 MB in 3 seconds (29.716 MB/s on average); 29.868 MB in last 1.000 seconds (29.866 MB/s); 22286976 numbers verified
118.972 MB in 4 seconds (29.744 MB/s on average); 29.824 MB in last 1.000 seconds (29.826 MB/s); 29742976 numbers verified


Teensy hardware setup #2 (single writes):

Teensy 4.0 running sketch given in post #7 above (as posted, modified for single writes by replacing the #if 0 with #if 1)

PC hardware setup #1 (single writes):

ASUS N56VM
Intel(R) Core(TM) i7-3610Q CPU @2.30GHz (3 cores)
8.00 GB RAM
Has one internal serial port, so used /dev/ttyACM1 for testing on the PC end
PC executing C-code given in post #7 above (as posted, unmodified, built as instructed at the top of the source file)
PC booting/running RHEL8.6 in rescue (text) mode from a 512GB bootable SanDisk Extreme Pro USB 3.1 stick built as a "live CD"

Results with PC hardware setup #1 (single writes):

Code:
4.719 MB in 1 seconds (4.718 MB/s on average); 4.719 MB in last 1.000 seconds (4.718 MB/s); 1179776 numbers verified
9.509 MB in 2 seconds (4.754 MB/s on average); 4.790 MB in last 1.000 seconds (4.790 MB/s); 2377344 numbers verified
14.300 MB in 3 seconds (4.766 MB/s on average); 4.790 MB in last 1.000 seconds (4.790 MB/s); 3574912 numbers verified
19.090 MB in 4 seconds (4.772 MB/s on average); 4.790 MB in last 1.000 seconds (4.790 MB/s); 4772480 numbers verified

PC hardware setup #2 (single writes):

ASUS FX507Z
12th Gen Intel(R) Core(TM) i7-12700H 2.30 GHz (13 cores)
64.0 GB RAM
Has no internal serial ports, so used /dev/ttyACM0 for testing on the PC end
PC executing C-code given in post #7 above (as posted, unmodified, built as instructed at the top of the source file)
PC booting/running RHEL8.6 in rescue (text) mode from a 512GB bootable SanDisk Extreme Pro USB 3.1 stick built as a "live CD"

Results with PC hardware setup #2 (single writes):

Code:
4.717 MB in 1 seconds (4.717 MB/s on average); 4.717 MB in last 1.000 seconds (4.717 MB/s); 1179264 numbers verified
9.507 MB in 2 seconds (4.754 MB/s on average); 4.790 MB in last 1.000 seconds (4.790 MB/s); 2376832 numbers verified
14.298 MB in 3 seconds (4.766 MB/s on average); 4.790 MB in last 1.000 seconds (4.790 MB/s); 3574400 numbers verified
19.088 MB in 4 seconds (4.772 MB/s on average); 4.790 MB in last 1.000 seconds (4.790 MB/s); 4771968 numbers verified

Let me know if I can provide anything else.

Mark J Culross
KD5RXT
 
Thanks! This backs up my claim and understanding, because the only difference between the two (25+ MB/s vs. under 5 MB/s) is how Serial.write() is called in the Teensy sketch.

Based on this, we can recommend Teensy developers use a buffer (buf) of suitable size (len, for example 32), and
Code:
if (Serial.availableForWrite() >= len)
    Serial.write(buf, len);
for maximum throughput (25+ MB/s) via USB Serial. Calling Serial.write(byte) in a loop limits the throughput to 4-5 MB/s, at least on Linux.

(y)
 
Back
Top