Fast Data Logger

Good question. With SdFat example bench, I see ~22 us as the fastest 512-byte write, but with SdFat example TeensySdioLogger, I see ~5 us. I'll look into it further.
 
@UhClem, the answer to this question just came to me today as I was answering another question and looking back at some notes on my SD bench-marking.

EDIT --- this is what I think is happening, and wondering if others agree.

An SD write() of 512 bytes can return in 4-5 us because the actual SDIO transfer and write to the physical SD flash medium occur after write() returns. The write() function just sets up the 512 bytes in a buffer, hands it off to SDIO, and returns. It does not wait for the SDIO transfer or the SD's write to the physical medium to complete.

The 22 us reported by the SdFat "bench" example reflects the actual SDIO transfer time, which is a little over 20 us for 512 bytes @ 50 MHz * 4 bits. So, you get 22 us per write for back-to-back 512-byte writes, with each write having to fully complete before the next one can begin. That's not what happens, though, in a typical data logging scenario. In my applications, for example, if the total data rate to SD is 1 MB/s, a call to write() occurs only once every 488 us, long after the previous SDIO transfer and actual SD write have completed, so write() almost always returns in 4-5 us. It's only when write() has to wait for the SDIO to be available or for the SD to be ready to receive more data that it takes longer for write() to return.

For not fully-understood reasons, the SD will occasionally remain busy for as long as 40 us, so the application needs a buffer that can hold something like 50 ms worth of data to avoid losing data during those delays. Applications can avoid blocking during these long delays by always writing in 512-byte chunks and only calling write() if file.isBusy() is false..
 
Last edited:
For not fully-understood reasons, the SD will occasionally remain busy for as long as 40 us, so the application needs a buffer that can hold something like 50 ms worth of data to avoid losing data during those delays. Applications can avoid blocking during these long delays by always writing in 512-byte chunks and only calling write() if file.isBusy() is false..
I've been trying to eliminate occasional long blocking in SdFat for a while now but still see this every 20s or so with these steps in place.

I'm preallocating the file, using a ringbuffer, only writing 512 byte chunks, checking for isBusy() etc but still get lengthy blocking (20ms+) fairly regularly. My limited understanding is that using FIFO_SDIO and those steps should mean that writeOut() is only ever writing to a SDIO buffer, but for some reason it appears not always be the case.
 
SdFat cannot block on a write of 512 bytes unless you call write() while isBusy() is true. Can’t say what’s wrong without seeing code.
 
The regular log delays in writing to SD cards is due to SD-card internal admin activities. We know that SD card have a small CPU that handles bad blocks and wear leveling. I observed delays up to 100 ms, depending on SD brand and model. Only way to overcome this is to buffer the data before writing to disk. I use all available Ram for this and write in blocks of 8 to 16 kB.
 
Last edited:
There are two keys to avoid blocking
  • ALWAYS call write in blocks of 512 bytes (1 sector)
  • ALWAYS check isBusy() before calling write()
If you do BOTH of those, your code will NEVER block inside SdFat write().

Conversely:
  • If you don't check isBusy(), your code can block.
  • If you check isBusy(), but write more than 512 bytes, your code can block.
Why? When you write without checking isBusy(), it may be that the previous write caused the SD to become busy, so your call can block. When you write, say, 8K bytes at a time, SdFat simply executes a loop internally that writes 512 bytes (1 sector) at a time, and any one of those sector writes can cause the SD to become busy. If that happens, your code will be stuck inside SdFat write even though you checked isBusy() before the write.

When you write 512 bytes with isBusy() == false, the write will ALWAYS return after about 5 us. You may find that isBusy() is true before your next write, and therefore must wait 40 ms or more to write again, but your code won't be stuck in SdFat and it can do other things.

Am I sure about this? Yes. 100% sure. I use this method in control systems with control cycles of 1 ms or less, and this method allows me to log data to SD and still maintain the 1 kHz execution of the control code.

The sketch below works as I describe. Data is written to a RingBuf from an IntervalTimer handler, to simulate reading from ADC, etc., and data is extracted from the RingBuf and written to SD in loop(). By default, the sketch writes to SD at 1 MB/s for 10 seconds, but you can increase/decrease those for your own tests. The logic for measuring execution times makes it look a little more complex than it really is. It prints the min/max write times, max RingBuf usage, total CPU time spent in calls to write, and percent of CPU. For 1 MB/s, less than 1% of CPU is used. That goes up with higher data rates of course, but that's plenty for my purposes. It works well to at least 5 MB/s, though the CPU usage is higher.

I use a SanDisk Extreme 32 GB card, and I recommend reformatting via the SdFat example before you run tests.

Code:
// Test Teensy SDIO with write busy in a data logger demo.
//
// The driver writes to the uSDHC controller's FIFO then returns while the
// controller writes the data to the SD.  The first sector puts the controller
// in write mode and takes about 11 usec on a Teensy 4.1. About 5 usec is
// required to write a sector when the controller is in write mode.

#include <SdFat.h>
#include <RingBuf.h>

//******************************************************************************
// global variables
//******************************************************************************
#define SD_CONFIG    (SdioConfig(FIFO_SDIO))        // use Teensy SDIO
#define LOGGING_TIME_S    (10)                // s
#define DATA_RATE_BPS    ((1)*(1024*1024))        // bytes/s
#define LOG_FILE_SIZE    (DATA_RATE_BPS * LOGGING_TIME_S)// total bytes
#define RING_BUF_SIZE    (DATA_RATE_BPS / 20)        // 50-ms buffer at BPS
#define BUFLEN        (256)                // bytes per write
#define C2US(c)        ((c)*(1E6/F_CPU))        // CPU cycles to us

//******************************************************************************
// global variables
//******************************************************************************
IntervalTimer dataTimer;        // IntervalTimer for ISR-level writes
RingBuf<FsFile,RING_BUF_SIZE> rb;       // ISR --> RingBuf --> loop --> SD file
SdFs     sd;                // SdFat type
FsFile   file;                // SdFat file type
size_t   rbMaxUsed = 0;            // RingBuf max bytes (useful diagnostic)
char     buf[BUFLEN];            // test buffer
uint32_t error;                // RingBuf/file error code

//******************************************************************************
// IntervalTimer callback -- write BUFLEN bytes to RingBuf
//******************************************************************************
void dataTimerCallback( void )
{
#if (SD_FAT_VERSION == 20102)        // #if SdFat 2.1.1
  rb.memcpyIn( buf, BUFLEN );        //   write to RingBuf via rb.memcpyIn()
#elif (SD_FAT_VERSION >= 20202)        // #elif SdFat >= 2.2.0
  rb.beginISR();            //   begin interrupt access
  rb.write( buf, BUFLEN );        //   write to RingBuf via rb.write()
  rb.endISR();                //   end interrupt access
#endif                    // #endif
  if (rb.getWriteError())        // if write error occurred
    error = 1;                //   set global error code
} 

//******************************************************************************
// setup()
//******************************************************************************
void setup()
{
  Serial.begin(9600);
  while (!Serial && millis() < 3000) {}

  Serial.printf( "%s  %s\n", __DATE__, __TIME__ );
  Serial.printf( "Teensy %s\n",
#if defined(ARDUINO_TEENSY35)
  "3.5" );
#elif defined(ARDUINO_TEENSY41)
  "4.1" );
#endif
  Serial.printf( "Teensyduino version %1lu\n", TEENSYDUINO );
  Serial.printf( "SdFat version %s\n", SD_FAT_VERSION_STR );

  // Initialize the SD.
  if (!sd.begin(SD_CONFIG)) {
    sd.initErrorHalt(&Serial);
  }
 
  // these 2 lines are necessary to enable cycle counting on T3.x
  #if (defined(KINETISL) || defined(KINETISK))        // if Teensy LC or 3.x
  ARM_DEMCR    |= ARM_DEMCR_TRCENA;            //   enable debug/trace
  ARM_DWT_CTRL |= ARM_DWT_CTRL_CYCCNTENA;        //   enable cycle counter
  #endif
}

//******************************************************************************
// loop()  open file, preAllocate, init RingBuf, log data, print results/stats
//******************************************************************************
void loop()
{
  while (Serial.available()) { Serial.read(); }
  Serial.println( "Type any character to begin" );
  while (!Serial.available()) {}
 
  Serial.printf( "Log for %1lu seconds at %1.2f MB/s (%1lu bytes per interrupt)\n",
        LOGGING_TIME_S, (float)(DATA_RATE_BPS/(1024*1024.0)), BUFLEN );
  Serial.printf( "Pre-allocated file %1lu bytes\n", LOG_FILE_SIZE );
  Serial.printf( "RingBuf %1lu bytes\n", RING_BUF_SIZE );

  // Open or create file - truncate existing file.
  if (!file.open( "logfile.txt", O_RDWR | O_CREAT | O_TRUNC )) {
    Serial.println( "open failed\n" );
  }
  // File must be pre-allocated to avoid huge delays searching for free clusters.
  else if (!file.preAllocate( LOG_FILE_SIZE )) {
     Serial.println( "preAllocate failed\n" );
     file.close();
  }
  // init the file and RingBuf
  else {
    rb.begin(&file);
  }
 
  // Init data buffer with random data
  randomSeed( micros() );
  for (int i=0; i<BUFLEN; i++)
    buf[i] = 0x30+random( 10 );
  buf[BUFLEN-1] = '\n';
 
  uint32_t timer_period_us = 1E6 * BUFLEN / DATA_RATE_BPS;
  Serial.printf( "Start dataTimer (period = %1lu us)\n", timer_period_us );
  dataTimer.begin( dataTimerCallback, timer_period_us );

  uint32_t count = 0, start_ms = millis(), start_busy = 0;
  bool busy = false;
  error = 0;
  rbMaxUsed = 0;
  elapsedMillis ms = 0;
  uint32_t sum_busy=0, min_busy=0xFFFFFFFF, max_busy=0;
  uint32_t sum_write=0, min_write=0xFFFFFFFF, max_write=0;
  while (error == 0 && millis() - start_ms < LOGGING_TIME_S*1000) {
      
    if (ms >= 1000) { Serial.print( "." ); ms -= 1000; }
    
    // number of bytes in RingBuf
    size_t n = rb.bytesUsed();
    if (n > rbMaxUsed) {
      rbMaxUsed = n;
    }

    // bytes in RingBuf now will fit, but any more will exceed file size
    if ((n + file.curPosition()) > (LOG_FILE_SIZE - BUFLEN/*_MAX*/)) {
      error = 2; // file full
    }

    // write one sector (512 bytes) from RingBuf to file
    // Not busy only allows one sector before possible busy wait
    if (file.isBusy()) {
      if (!busy) {
        busy = true;
        start_busy = ARM_DWT_CYCCNT;
      }
    }
    else if (busy) {
      busy = false;
      uint32_t busy_cyc = ARM_DWT_CYCCNT - start_busy;
      sum_busy += busy_cyc;
      if (busy_cyc < min_busy) min_busy = busy_cyc;
      if (busy_cyc > max_busy) max_busy = busy_cyc;
    }
    if (n >= 512 && !busy) {
      uint32_t start_write = ARM_DWT_CYCCNT;
      if (512 != rb.writeOut(512)) {
        error = 1; // write error
      }
      uint32_t write_cyc = ARM_DWT_CYCCNT - start_write;
      sum_write += write_cyc;
      if (count > 0) {
        if (write_cyc < min_write) min_write = write_cyc;
        if (write_cyc > max_write) max_write = write_cyc;
      }
      count++;
    }
  }
 
  uint32_t duration_ms = millis() - start_ms;
  Serial.printf( "\nStop dataTimer\n" );
  dataTimer.end();
 
  double duration_s = duration_ms/1000.0;
  double write_s = C2US(sum_write)/1E6;
  double write_percent = 100*(write_s/duration_s);
  Serial.printf( "%1lu writes in %1.3lf s (%1.3lf s writing to file = %1.3lf %c)\n",
        count, duration_s, write_s, write_percent, '%' );
  switch (error) {
    case 0:   Serial.printf( "No error\n" );            break;
    case 1:   Serial.printf( "Not enough space in RingBuf\n" );    break;
    case 2:   Serial.printf( "File is full\n" );        break;
    case 3:   Serial.printf( "Write from RingBuf failed" );    break;
    default:  Serial.printf( "Undefined error %1lu\n", error );    break;
  }

  // write any remaining RingBuf data to file
  Serial.printf( "fileSize     = %10lu before sync()\n", (uint32_t)file.fileSize() );
  Serial.printf( "rb.bytesUsed = %10lu before sync()\n", (uint32_t)rb.bytesUsed() );
  rb.sync();
 
  // file and buffer stats
  Serial.printf( "fileSize     = %10lu after sync()\n", (uint32_t)file.fileSize() );
  Serial.printf( "rb.bytesUsed = %10lu after sync()\n", (uint32_t)rb.bytesUsed() );
  Serial.printf( "rbMaxUsed    = %10lu\n", (uint32_t)rbMaxUsed );
  Serial.printf( "min write us = %10.2lf\n", C2US(min_write) );
  Serial.printf( "max write us = %10.2lf\n", C2US(max_write) );
  Serial.printf( "min busy  us = %10.0lf\n", C2US(min_busy) );
  Serial.printf( "max busy  us = %10.0lf\n", C2US(max_busy) );
 
  // print first N line(s) of file.
  int lines=0;
  if (lines > 0) {
    Serial.printf( "First %1d line(s) of file\n", lines );
    file.truncate();
    file.rewind();
  }
  for (int n=0; n < lines && file.available();) {
    int c = file.read();
    if (c < 0) break;
    Serial.write(c);
    if (c == '\n') n++;
  }
 
  // close file
  file.close();
  Serial.printf( "file.close()\n\n" );
}
 
There are two keys to avoid blocking
  • ALWAYS call write in blocks of 512 bytes (1 sector)
  • ALWAYS check isBusy() before calling write()
I would add a 'third option' to handle latency issues of SD cards, which is the one I'm using in my high speed data logger (mega-samples/second).
- allow blocking
How?
use the leveled interrupt system (as RTOS have done since last Millennium)
in particular,
- run in loop() ONLY code to access SD disk.
- All other code (acquisition, processing) should be run at interrupt level. HW interface at higher interrupt levels, processing on SW triggered lower levels, but at interrupt level.
- If needed, use 16 MB PSRAM on T4.1 as buffer.

This way you exploit the ARM architecture.
 
I would add a 'third option' to handle latency issues of SD cards, which is the one I'm using in my high speed data logger (mega-samples/second).
- allow blocking
How?
use the leveled interrupt system (as RTOS have done since last Millennium)
in particular,
- run in loop() ONLY code to access SD disk.
- All other code (acquisition, processing) should be run at interrupt level. HW interface at higher interrupt levels, processing on SW triggered lower levels, but at interrupt level.
- If needed, use 16 MB PSRAM on T4.1 as buffer.

This way you exploit the ARM architecture.

Yes, that's an alternative, though it doesn't allow any higher data rate to SD, and it's in contradiction to the usual embedded system advice of making ISRs as short as possible. Wouldn't that make it difficult or impossible to use libraries that are not designed to run at interrupt level?
 
I’m finding my need for long runs of 18 megabytes/sec is pushing it. I believe there is a need for a 50 msec buffer, but that is 900kbytes. I can get about 400kb which is not enough. I’ll run a good 7 seconds or so before it fails. The failure time seems very consistent. I am not doing a full erase each time, but using the SDfat bench, it doesn’t seem to make much difference. Though I guess if it shortens these ‘housekeeping’ times it would not show up in the average speed reported by the test.
I may have to reduce the number of channels I save or, perhaps, note dropouts and log them. I can lose a bit of data every now and then.
Dang.
 
I’m finding my need for long runs of 18 megabytes/sec is pushing it. I believe there is a need for a 50 msec buffer, but that is 900kbytes. I can get about 400kb which is not enough. I’ll run a good 7 seconds or so before it fails. The failure time seems very consistent. I am not doing a full erase each time, but using the SDfat bench, it doesn’t seem to make much difference. Though I guess if it shortens these ‘housekeeping’ times it would not show up in the average speed reported by the test.
I may have to reduce the number of channels I save or, perhaps, note dropouts and log them. I can lose a bit of data every now and then.
Dang.
Been a while, but I think the highest sustained rate I could achieve was about 10 MB/s, with a 500KB buffer in RAM. I tried buffering in PSRAM, but it wasn’t fast enough, and I’m pretty sure I was using the higher clock frequency discussed in some forum threads.

Can you scale your data to fewer bits per value?
 
that makes sense- 50msec buffer with 500k is 10Mbytes/sec. Thanks- not happy about it, but I know I need a backup plan.
Yeah- three options
1. Reduce the resolution as you suggest
2. Reduce the number of channels. I have 384 and most of them are not used generally, so that’s viable
3. Catch when the buffer is close to full and just ignore data for however long it takes. This may be the easiest to implement and maximizes my data capture. I’ll need to track the gaps.

I need to check what class SD card I’m using. It is a good one. I wonder if there might be a better class that minimizes lost time. Or is it just a given that one has to have a large data buffer? I don’t have room to add that. My board is only 17x21mm and there just ain’t no more room.
 
We are interfacing to a Neuropixels probe. It can digitize up to 384 channels of ephys data - 12 bits per channel, 30khz. Plus frame syncs.

I’ll post the whole design and all when it’s all completed.

I’m not using RAM2. Can it be used as a buffer for this?
 
I’ll look into that. I made up my own buffer management as I want it to be really simple and fast, so for example, it’s a multiple of SD buffer length long so roll over is simple.

I think I’ll see if I can have buffers in both RAM areas. When the first one fills up, jump to the second one, then when that fills , jump back to the start of the first. Might that work? It’s not clear to me that the access times to the two RAMs are equally fast, but that would give me around 900kbytes.

I appreciate all the feedback!
 
RAM1 is "TCM" (tightly coupled memory) and is faster than RAM2. In my applications, having the buffer in RAM2 made no noticeable difference. I'm sure you could split your buffer across RAM1 and RAM2. I always use RingBuf from SdFat, which I like because the file is associated with the buffer, so splitting the buffer in two is something I've never considered.
 
I would assume that your data are some-what correlated and would try to do some compression. I assume that you already pack 8 channels into 3 int32 words.
As most data are typically less than 12 bit, you could pack more dense, by inspecting each (large) data buffer on the maximum bit depth required to describe the data and pack accordingly. Also, you may first shift right by 1 or 2 as the lowest bit are usually ADC noise. If data are correlated you could first implement a first order difference (and keep the fist value) before compressing.
You would add a header to your data indicating a timestamp (millis), bit depth, number of samples, first (uncompressed) sample . This is purely integer compression, but I do not know the name of it.
I do it all the time and it is easy and quick. The actual implementation may depend on the existing ADC-DMA processing chain.
 
that makes sense- 50msec buffer with 500k is 10Mbytes/sec. Thanks- not happy about it, but I know I need a backup plan.
Yeah- three options
1. Reduce the resolution as you suggest
2. Reduce the number of channels. I have 384 and most of them are not used generally, so that’s viable
3. Catch when the buffer is close to full and just ignore data for however long it takes. This may be the easiest to implement and maximizes my data capture. I’ll need to track the gaps.

I need to check what class SD card I’m using. It is a good one. I wonder if there might be a better class that minimizes lost time. Or is it just a given that one has to have a large data buffer? I don’t have room to add that. My board is only 17x21mm and there just ain’t no more room.
It's too bad that you are space-constrained. If this is a pure store-now/analyze-later system, you could always try doubling your storage rate by adding a second T4.1 and have the two boards take turns capturing samples and writing them to SD. That way you get twice the SD write bandwidth, twice the time to collect, twice the buffer space. Most of the software on the two would be the same. One Teensy would have a bit more code to tell the other when to start and stop. Incremental Bill-of-Materials cost: Bigger PCB: $2; A few logic chips: $1; USB 2-port Hub: $5; Teensy 4.1: $35. If you really must fit packaging constraints, you could hire a really good engineer and layout person to reduce each T4.1 to a 0.9 x 0.9-inch two-PCB stack 0.5" thick. You can't get much smaller and still use the same MPU package and have connectivity for 8 or 16-bit internal and external connectors and a Micro-SD card.
 
Great ideas, thanks!

The problem with compression is that it takes time and resources, though I do have an FPGA (which does the packing the 3 bits out to 32). That might be a good backup plan.

I actually did a data logger about 20 years ago with similar capabilities. I had to stripe two SCSI drives since one was too slow. It amazing to me that design was about the size of a desktop computer and now we can make it so small. I had thought about striping two SD cards as the spec does provide for that, though I imagine exFat does not deal with that. Doubling up also doubles power and I need to run at least 30 minutes (hopefully 1 hour) on a couple of small lithium batteries.

But - some good news!
I took the advice about the RAM2 area. I made two buffers, one in each memory area. I was able to get 439,296 bytes in RAM1 and 505,856 bytes in RAM2 for a total of 945,152 bytes. When I hit the end of one buffer I jump to the start of the other. I track the maximum buffer usage. I did a 150 second run and max buffer usage was 132,227. However I did have some data errors. I tried a different card and it worked, no errors. Max buffer usage 134,904.

I also realized my SD Cards are only class I, which are only rated at 10MB/s (oops!). I have some high speed cards on order.
 
But - some good news!
I took the advice about the RAM2 area. I made two buffers, one in each memory area. I was able to get 439,296 bytes in RAM1 and 505,856 bytes in RAM2 for a total of 945,152 bytes. When I hit the end of one buffer I jump to the start of the other. I track the maximum buffer usage. I did a 150 second run and max buffer usage was 132,227. However I did have some data errors. I tried a different card and it worked, no errors. Max buffer usage 134,904.

I also realized my SD Cards are only class I, which are only rated at 10MB/s (oops!). I have some high speed cards on order.

That's good. Your total buffer is 945,152 bytes, so when you say max buffer usage was 132,227, does that mean you don't need both buffers after all?

Not a short-term fix, but if you want to get the RAM limit out of the way entirely, see the link below and the work by @Dogbone06 on DIY boards with 32 MB of SDRAM.

https://forum.pjrc.com/index.php?threads/diy-teensy-sdram-solder-yourself.76887/#post-357855
 
Back
Top