HardwareSerial read() suggestion for large reads

rodv92

Member
I was experimenting with the Teensy4 implementation of the HardwareSerial read ring buffer.
So I noticed that there was a primary buffer that is always used, and an additional
buffer, which size is set using addMemoryForRead.
As for the rest of the implementation, it does no seems to change much compared to Arduino,
byte level functions are in HardwareSerialX, And higher level read functions are in Stream.cpp

Since there may be cases where reads on the serial ports are infrequent and the read ring buffer is provisioned to large values, such as > 1KB
It could be nice to have higher level functions (Stream.cpp) that do not rely on a loop that calls read() n times to fetch all available bytes,
But to get n to all available bytes in one call, or get bytes until a provided separator character is found, and compare the performance.

I made some tests about such a "get_" function, using HardwareSerial read() as a basis.

  • It uses memcpy / memchr for memory copy and separator scanning in a single function.
  • Disables interrupt over all encompassing scopes / branches that use "memchrcpy", since the buffers are volatile, and we don't want FIFO RX interrupt to mess with the ring buffer while we copy. This may be the limiting factor based on the performance of the function for memcpy of large buffers.
  • Takes into account both buffers (the primary with rx_buffer_size bytes) and extended buffer (rx_total_buffer_size) for memcpy / memchr operations
  • No change in flow control pin (cts/rts) behaviour

Currently testing for dropped chars using the new get_ function, using variable window size to ensure all branches of the memchrcpy logic are accessed and free of bug,
As for performance : For now, it seems that the performance gains are significant already at low tens of bytes read get_ vs read() loop, This will be the object of the follow up.

C++:
// This is a utility function to copy from source to destination but also look for a separator character and stop copying if it is found.  It returns the position of the separator character if found or 0 if not found.  if use_sep is false, then it just does a normal memcpy and returns 0.
inline int HardwareSerialIMXRT::memchrcpy(uint8_t* dst, uint8_t* src, int length, bool use_sep = false, const char* sep = "\x0A")
{
    if(use_sep)
    {
        uint8_t* result = (uint8_t*) memchr(src, *sep, length);
        
        size_t position = (result != nullptr) ? (result - src + 1) : length;
        memcpy(dst, src, position);
        return (result != nullptr) ? position : 0;
    }
    else
    {
        memcpy(dst, src, length);
        return 0;
    }

}

// Get up to n bytes of data from the receive buffer and place them into the provided buffer.  If use_sep is true, then look for the separator character and stop copying if it is found.  Return the number of bytes copied into the buffer which could be less than length if the separator character was found or if there was not that much data available.
int HardwareSerialIMXRT::get_(uint8_t* bytes, int length, bool use_sep = false, const char* sep = "\x0A")
{
    IMXRT_LPUART_t *port = (IMXRT_LPUART_t *)port_addr;
    uint32_t head, tail;
    int c;
    int newlength = 0;
    __disable_irq();
    uint32_t cycles = ARM_DWT_CYCCNT;
    head = rx_buffer_head_;
    tail = rx_buffer_tail_;
    //debug = rx_buffer_tail_;
    if (head == tail) {
        // empty Now check for stuff in FIFO Queue.
        c = -1;    // assume nothing to return
        if (port->WATER & 0x7000000) {
            uint32_t newhead;
            c = port->DATA & 0x3ff;        // Use only up to 10 bits of data
            newhead = head + 1;

            if (newhead >= rx_buffer_total_size_) newhead = 0;
            if (newhead != rx_buffer_tail_) {
                head = newhead;
                if (newhead < rx_buffer_size_) {
                    rx_buffer_[head] = c;
                } else {
                    rx_buffer_storage_[head-rx_buffer_size_] = c;
                }
            }
        rx_buffer_head_ = newhead;
        head = newhead;
        }
    }

    //if (++tail >= rx_buffer_total_size_) tail = 0;
    
    int avail;
    if (head >= tail)
    {
        avail = head - tail;
        length = (avail > length) ? length : avail;

        if (tail + length <= rx_buffer_size_)
        {
            newlength = memchrcpy(bytes, (uint8_t*) &(rx_buffer_[tail]), length, use_sep, sep);
            if(newlength) {length = newlength; goto func_exit;}
            //c= -2; // debug   
        }
        else if(tail < rx_buffer_size_ )
        {
            if(tail + length > rx_buffer_size_)
            {
                uint8_t first_part_length = rx_buffer_size_ - tail;
                newlength = memchrcpy(bytes, (uint8_t*) &(rx_buffer_[tail]), first_part_length, use_sep, sep);
                if(newlength) {length = newlength; goto func_exit;}
                newlength = memchrcpy(bytes + first_part_length, (uint8_t*) &(rx_buffer_storage_[0]), length - first_part_length, use_sep, sep);
                if(newlength) {length = first_part_length + newlength; goto func_exit;}
                //c = -3; // debug   
            }
            else
            {
                newlength = memchrcpy(bytes, (uint8_t*) &(rx_buffer_[tail]), length, use_sep, sep);
                if(newlength) {length = newlength; goto func_exit;}
                //c = -4; // debug   
            }
        }
        else // tail >= rx_buffer_size_
        {
            newlength = memchrcpy(bytes, (uint8_t*) &(rx_buffer_storage_[tail - rx_buffer_size_]), length, use_sep, sep);
            if(newlength) {length = newlength; goto func_exit;}
            //c = -5; // debug   
        }
        
    }
    else // head < tail
    {
        avail = rx_buffer_total_size_ + head - tail;
        length = (avail > length) ? length : avail;
        if(tail >= rx_buffer_size_)
        {
            if(tail + length <= rx_buffer_total_size_)
            {
                newlength = memchrcpy(bytes, (uint8_t*) &(rx_buffer_storage_[tail - rx_buffer_size_]), length, use_sep, sep);
                if(newlength) {length = newlength; goto func_exit;}
                //c = -6; // debug   
            }
            else // tail + length > rx_buffer_total_size_
            {
                
                newlength = memchrcpy(bytes, (uint8_t*) &(rx_buffer_storage_[tail - rx_buffer_size_]), rx_buffer_total_size_ - tail, use_sep, sep);
                if(newlength) {length = newlength; goto func_exit;}
                
                if(length - (rx_buffer_total_size_ - tail) > rx_buffer_size_)
                {
                    
                    newlength = memchrcpy(bytes + (rx_buffer_total_size_ - tail), (uint8_t*) &(rx_buffer_[0]), rx_buffer_size_, use_sep, sep);
                    if(newlength) {length = rx_buffer_total_size_  - tail + newlength; goto func_exit;}
                    newlength = memchrcpy(bytes + (rx_buffer_total_size_ - tail) + rx_buffer_size_, (uint8_t*) &(rx_buffer_storage_[0]), length - (rx_buffer_total_size_ - tail) - rx_buffer_size_, use_sep, sep);
                    if(newlength) {length = (rx_buffer_total_size_ - tail) + rx_buffer_size_ + newlength; goto func_exit;}
                    //c = -7; // debug   
                }
                else
                {
                    newlength = memchrcpy(bytes + (rx_buffer_total_size_ - tail), (uint8_t*) &(rx_buffer_[0]), length - (rx_buffer_total_size_ - tail), use_sep, sep);
                    if(newlength) {length = rx_buffer_total_size_ - tail + newlength; goto func_exit;}           
                    //c = -8; // debug   
                }
            }
        }
        else // tail < rx_buffer_size_
        {
            if(tail + length <= rx_buffer_size_)
            {   
                newlength = memchrcpy(bytes, (uint8_t*) &(rx_buffer_[tail]), length, use_sep, sep);
                if(newlength) {length = newlength; goto func_exit;}
                //c = -9; // debug   
            }
            else // tail + length > rx_buffer_size_
            {
                if(tail + length > rx_buffer_total_size_)
                {
                    newlength = memchrcpy(bytes, (uint8_t*) &(rx_buffer_[tail]), rx_buffer_size_ - tail, use_sep, sep);
                    if(newlength) {length = newlength; goto func_exit;}
                    newlength = memchrcpy(bytes + (rx_buffer_size_ - tail), (uint8_t*) &(rx_buffer_storage_[0]), rx_buffer_total_size_ - rx_buffer_size_, use_sep, sep);
                    if(newlength) {length = (rx_buffer_size_ - tail) + newlength; goto func_exit;}
                    newlength = memchrcpy(bytes + (rx_buffer_total_size_ - tail), (uint8_t*) &(rx_buffer_[0]), length - (rx_buffer_total_size_ - tail), use_sep, sep);
                    if(newlength) {length = (rx_buffer_total_size_ - tail) + newlength; goto func_exit;}
                    //c = -10; // debug

                }
                else // tail + length <= rx_buffer_total_size_
                {
                    
                    newlength = memchrcpy(bytes, (uint8_t*) &(rx_buffer_[tail]), rx_buffer_size_ - tail, use_sep, sep);
                    if(newlength) {length = newlength; goto func_exit;}
                    newlength = memchrcpy(bytes + (rx_buffer_size_ - tail), (uint8_t*) &(rx_buffer_storage_[0]), length - (rx_buffer_size_ - tail), use_sep, sep);
                    if(newlength) {length = (rx_buffer_size_ - tail) + newlength; goto func_exit;}
                    //c = -11; // debug

                }

            }

        }

    }
    func_exit:
    //debug2 = rx_buffer_tail_;
    //debug = ARM_DWT_CYCCNT - cycles;
    //debug2 = ARM_DWT_CYCCNT - cycles;
    rx_buffer_tail_ = (rx_buffer_tail_ + length >= rx_buffer_total_size_) ? rx_buffer_tail_ + length - rx_buffer_total_size_ : rx_buffer_tail_ + length;
    __enable_irq();
    return length;
    //return c; // debug   
}

Higher level functions that use the low level get_ would live inside Stream.cpp

Such as :

C++:
int Stream::getuntil(uint8_t* bytes, int length, const char* sep = "\x0A")
{
    return get_(bytes, length, true, sep);
}

int Stream::get(uint8_t* bytes, int length)
{
    return get_(bytes, length, false, nullptr);
}
 
Hi Paul, yes, I used the following methodology for benchmarking :

The sender is an Arduino Due using HardwareSerial at 9600 bps, sleep 1s, send a burst of 256 bytes (the circular buffer size has been tweaked to 2048 instead of 64) and repeat
The receiver is the Teensy 4.1

I prompted a test tool that would compare ReadBytes vs memcpy based 'get' using a geometric progression from 2 bytes up to 2048 bytes read
Here are the results, and the tools used :
HardwareSerial.cpp : the patched Serial (fixed a bug for a variable using uint8_t instead of size_t)
tx.ino = the serial TX sketch at 9600 bps uses Serial1 pins (Arduino)
perf_rx.ino = the serial RX sketch (teensy4.1), with performance benchmarking, Uses Serial2 pins

┌──────────┬──────────────────────────┬──────────────────────────────┐
│ size (B) │ get() │ readBytes() │
│ ├────────────┬─────────────┼────────────┬─────────────────┤
│ │ cyc/byte │ ns/byte │ cyc/byte │ ns/byte x SPEEDUP RATIO |
├──────────┼────────────┼─────────────┼────────────┼─────────────────┤
│ 2 │ 182.31 │ 303.85 │ 110.50 │ 184.17 x 0.6 │
│ 4 │ 58.06 │ 96.77 │ 55.88 │ 93.12 x 1.0 │
│ 8 │ 48.34 │ 80.57 │ 44.44 │ 74.06 x 0.9 │
│ 16 │ 17.09 │ 28.49 │ 34.06 │ 56.77 x 2.0 │
│ 32 │ 13.54 │ 22.56 │ 32.27 │ 53.79 x 2.4 │
│ 64 │ 5.31 │ 8.85 │ 30.30 │ 50.50 x 5.7 │
│ 128 │ 3.61 │ 6.02 │ 29.74 │ 49.56 x 8.2 │
│ 256 │ 2.11 │ 3.52 │ 29.26 │ 48.76 x13.9 │
│ 512 │ 1.58 │ 2.64 │ 29.11 │ 48.51 x18.4 │
│ 1024 │ 1.23 │ 2.06 │ 29.02 │ 48.37 x23.5 │
│ 2048 │ 0.97 │ 1.61 │ 28.99 │ 48.32 x29.9 │
└──────────┴────────────┴─────────────┴────────────┴─────────────────┘
F_CPU = 600 MHz


Looks promising, but there is still some work to do on the getuntil code on my part and the getString that i did not code yet.
As for reliability, i prompted a test tool that would hit all the branches, and i would need to test it at higher baud rates, some hardware errors are expected, but the total error should not get significantly above the hardware error rate already published, or it is a software bug.
 

Attachments

  • HardwareSerial.cpp
    32.9 KB · Views: 13
  • tx.ino
    357 bytes · Views: 11
  • HardwareSerial.h
    16.9 KB · Views: 26
  • perf_rx.ino
    7.2 KB · Views: 20
Thanks for sharing the test code! I can't work with this right now, but I am really interested to look at the speedup when I have some time.

One quick question though. If some or all of this code gets merged to the core library, can you confirm MIT license?
 
I did a little searching and found there is a popular (1500 stars on github) open-source ring buffer called "lwrb" (lightweight ring buffer) that provides these capabilities. The features list says it is thread-safe and interrupt-safe for Cortex-M, and is MIT licensed. Would it make sense to use this instead?

https://github.com/MaJerle/lwrb

Features​

  • Written in C (C11), compatible with size_t for size data types
  • Platform independent default code - with restrictions for smaller CPU architectures (< sizeof(size_t))
  • FIFO (First In First Out) buffer implementation
  • No dynamic memory allocation, data is static array
  • Uses optimized memory copy instead of loops to read/write data from/to memory
  • Thread safe when used as pipe with single write and single read entries - when CPU read/write operation for size_t are single instruction (ARM Cortex-M for instance)
  • Interrupt safe when used as pipe with single write and single read entries - when CPU read/write operation for size_t are single instruction (ARM Cortex-M for instance)
  • For CPU systems with smaller architecture than sizeof(size_t) (AVR for instance), atomic protection is required for read-write operation of buffer writes
  • Suitable for DMA transfers from and to memory with zero-copy overhead between buffer and application memory
  • Supports data peek, skip for read and advance for write
  • Implements support for event notifications
  • User friendly MIT license
 
In my opinion,
lwrb looks really decent, but integrating a feature rich library at the teensyduino core level just for serial operations seems a bit too far fetched, maybe that could make sense if other parts of the teensyduino core would benefit from it, but in all cases that would be a significant overhaul.
Moreover, The teensyduino uses two buffers (the standard one, 64 bytes size hardcoded in define + an optional additional buffer through the addMemoryForRead() call) Using an external library should preserve that mechanism or it would break the existing code base.

What's nice however is that if lwrb uses memcpy on volatile buffers in MCU context with interrupts that means it is a viable option and has been demonstrated to work in practice. I will look at the memcpy parts and see how the transfers are hardened against IRQ interference.

What is missing in my code however is probably some kind of memory barrier : asm("dmb" ::: "memory") after disable_irq() and before enable_irq().
That memory barrier would ensure that all memcpy operations are 100% separated from IRQ interference on the ring buffer memory.

Besides that, I think that getuntil() should be made homogeneous not to return the separator as in readbytesuntil(), but these are not ring buffer considerations, I am still working on it, also making the rest of the string vs byte functions.
 
It's more or less done. The new functions getBytes, GetBytesUntil, GetString, GetStringUntil are defined in Stream,cpp and get() inside HardwareSerial.cpp
Note that the terminator is part of the return string / buffer, to discard it modify in HardwareSerial.cpp
Code:
size_t position = (result != nullptr) ? (result - src + 1) : length;

usb_serial.h is added just because get() should be defined there and overriden or the compiler complains.

Note that the functions are non-blocking. They return the lesser of available bytes or requested length.
For the "until" functions, if the separator is not found, the buffer is filled with the lesser of available bytes or requested length. This may be a problem if the terminator is not part of the filled buffer when found, as it would be impossible for the caller to know if the separator was found or not.

Another option would be to return nothing if the separator is not found, which seems to be the canonical way of dealing with "until" functions, and not update the buffer tail.
 

Attachments

  • Stream.cpp
    8.3 KB · Views: 13
  • HardwareSerial.cpp
    32.4 KB · Views: 10
  • HardwareSerial.h
    16.6 KB · Views: 12
  • usb_serial.h
    18.3 KB · Views: 8
  • Stream.h
    4.6 KB · Views: 9
Last edited:
Back
Top