Onboard flash can be faster...

jmarsh

Well-known member
Currently the read speed for onboard flash on Teensy 4.x is not as optimal as it could be.
It supports a "continuous read" mode where successive reads can skip sending the command byte (which is 0xEB for reads). Since this byte is sent in SPI mode using one data line, it takes 8 FlexSPI clocks (60ns / 36 CPU cycles @ 600MHz) for every read. FlexSPI supports this mode by using JUMP_ON_CS in the read LUT entry; the only tricky part is disabling continuous read mode if we want to send a different command to the flash (e.g. erasing/writing), but that's pretty easy to fix.

PR to implement it is here: https://github.com/PaulStoffregen/cores/pull/785
 
Another idea (I haven't tested this one) to allow interrupts to stay enabled during erasing/programming, reorder the LUT as follows:
- entry X: the erase/program suspend command (0x75) including the timing delay for tSUS
- entry X+1: the regular read command (without continuous reading enabled)
- entry X+2: the erase/program resume command (0x7A)

When programming begins, ARDSEQID and ARDSEQNUM for the FLSHA1CR2 register are pointed at the 3 LUT entries beginning at X. So if a read occurs during programming, the operation automatically gets suspended and resumed while the read is being performed. When programming finishes FLSHA1CR2 gets restored to the regular values.

(I guess you could try and fit all the commands into one LUT entry, I'm just not sure they would fit.)
 
@jmarsh - seemed interesting - grabbed the two files for IDE 1 building.
Three sketches worked.
#1 - current ST display sketch - maybe two of them - nothing special with FLASHMEM

#2 - LittleFS/Integrity/PROG - runs LittleFS with PROG FLASH storage - edited the code in 'functions' to be FLASHMEM to force more flash read.
Did not do any speed test of file read versus write - but the file data is error checked with reads after write.

#3 - the CODE4CODE.ino where FLASH is filled with 4000 FLASHMEM functions created to call down into each other. This was done to test effect of LOCKING on speed and function. It still works and <PR> runs a bit faster, based on the hinky math attempted to compare one set in RAM1 to those in FLASHMEM for the same function.

Current release code - more us:
Code:
    Not enabled _isr() char[] Testing.
Cascading 4000 calls took 5254362 us [3044818683 piCycles] : net 22715 us
Direct calls took 5227382 us [3041138942 piCycles] : net 5227377 us

    ENABLED _isr() char[] Test @50 us    _isr Cycles 347619320 of 582000082 : CPU %=0.597284
    ENABLED _isr() char[] Test @100 us    _isr Cycles 177148619 of 582000080 : CPU %=0.304379
    ENABLED _isr() char[] Test @200 us    _isr Cycles 87408130 of 582000080 : CPU %=0.150186
Cascading 4000 calls took 6172637 us [6131896 piCycles] : net {less Pi} 40741 us
    _isr Cycles 520992150 of 3592475233 : CPU %=0.145023
Cascading 4000 calls took 6172637 us : net {less isr} 3568763930 us
Cascading 4000 calls took 6172637 us : Cycles/call 1543

Direct calls took 6178916 us [6176388 piCycles] : net {less Pi} 2528 us
    _isr Cycles 542448320 of 3596129045 : CPU %=0.150842
Direct calls took 6178916 us [3594658348 piCycles] : net {less isr} 5246875 us
Direct calls took 6178916 us [3594658348 piCycles] : Cycles/call 1544

Altered LUT usage per <PR> - fewer us:
Code:
    Not enabled _isr() char[] Testing.
Cascading 4000 calls took 5253589 us [3044763703 piCycles] : net 22037 us
Direct calls took 5227378 us [3041136902 piCycles] : net 5227373 us

    ENABLED _isr() char[] Test @50 us    _isr Cycles 344390335 of 582000082 : CPU %=0.591736
    ENABLED _isr() char[] Test @100 us    _isr Cycles 175504696 of 582000081 : CPU %=0.301554
    ENABLED _isr() char[] Test @200 us    _isr Cycles 86591319 of 582000080 : CPU %=0.148782
Cascading 4000 calls took 6161680 us [6123445 piCycles] : net {less Pi} 38235 us
    _isr Cycles 515081061 of 3586098205 : CPU %=0.143633
Cascading 4000 calls took 6161680 us : net {less isr} 3563845279 us
Cascading 4000 calls took 6161680 us : Cycles/call 1540

Direct calls took 6168667 us [6166350 piCycles] : net {less Pi} 2317 us
    _isr Cycles 536511642 of 3590164013 : CPU %=0.149439
Direct calls took 6168667 us [3588815854 piCycles] : net {less isr} 5246826 us
Direct calls took 6168667 us [3588815854 piCycles] : Cycles/call 1542

So, the hinky math seems to be doing something right detecting les run time.
This T_4.1 is not Locked - but both run on same unit without errors detected so not relevant.

Code:
Memory Usage on Teensy 4.1:
  FLASH: code:1214588, data:328112, headers:8656   free for files:6575108
   RAM1: variables:83424, code:45992, padding:19544   free for local variables:375328
 
I tested the idea of using the suspend/resume commands instead of keeping interrupts disabled while polling the busy status. It doesn't seem to work... the FlexSPI documentation is quite vague about how executing multiple LUT sequences as a single command works, and I suspect it stops as soon as it sees the first STOP instruction - which makes the whole idea of the ARDSEQNUM field pointless when it could just run sequentially until it reaches a STOP. There's no other method to force CS high during a sequence so it's impossible to execute suspend followed by a read followed by a resume.

My fallback idea is to use interrupt priority 0 for "safe" interrupts that are absolutely essential (can't be blocked by flash writing) and set BASEPRI to 16 instead of disabling interrupts during busy polling.
 
Actually... something strange is going on. I have a flexspi_setup() function that gets called first thing from setup() that unlocks the LUT, sets the required entries then locks it again. When the Teensy has been freshly programmed, this function works as expected. But when it's restarted without programming, it fails and leaves the flash inaccessible causing the program to die with an IBUSERR at the first FLASHMEM instruction.

In both cases (programming+restart vs. restart only) the initial state of the LUT is LOCKED. So the unlocking seems to work... except when it doesn't?

Edit: of course I figured it out immediately after posting... all of the unused LUT instructions have to be zeroed, not just the first unused one... which raises more possibilities!
 
Last edited:
Well I got it working... but the results aren't great. Here's the reason why:
flash_suspend.png

This is the timing for the flash memory's suspend command. tSUS is defined as a "maximum" (meaning it has to be implemented as a minimum delay) of 20 microseconds. With a small footnote: "Value guaranteed by design and/or characterization, not 100% tested in production." Well guess what, it's actually closer to 25 micros - waiting only 20 means the flash chip won't respond to a subsequent read command. The same delay (tSUS) must be observed after a resume command, for two reasons: the datasheet says it's explicitly required before suspending again, and also because the resume command immediately removes the SUS status bit without the BUSY status bit being restored until tSUS has passed. This makes the chip look like it has finished the programming command if you read the status registers too soon after resuming...

The solution I found is setting CSINTERVAL (in FLSHA1CR1) to 3500 - this ensures CS will stay high for 26.25us (based on the FlexSPI clock of 133MHz) between commands, which is really the only way to ensure it happens for the suspend and resume commands. But it also slows down the flash operation in general and if it does have to execute code from flash (which is the whole point of this exercise) the CPU stalls, which is basically the same thing as having the interrupts disabled. I'm pretty sure I've also seen the FlexSPI AHB timeouts get triggered although I haven't logged the interrupts to confirm it.
 
Modifying LittleFS so it only erases a block when it actually needs it helps a lot:
Code:
int LittleFS_Program::static_erase(const struct lfs_config *c, lfs_block_t block)
{
  //Serial.printf("   prog er: block=%d\n", block);
  uint8_t *p = (uint8_t *)(baseaddr + block * SECTOR_SIZE);
  for (size_t i = 0; i < SECTOR_SIZE; i++) {
    if (p[i] != 0xFF) {
#if SECTOR_SIZE == 4096
      eepromemu_flash_erase_sector(p);
#elif SECTOR_SIZE == 32768
      eepromemu_flash_erase_32K_block(p);
#elif SECTOR_SIZE == 65536
      eepromemu_flash_erase_64K_block(p);
#else
      #error "Program SECTOR_SIZE must be 4096, 32768, or 65536"
#endif
      break;
    }
  }
  return 0;
}

I call formatUnused(0,0) directly after the filesystem is mounted so all of the empty blocks are pre-erased.
 
Good! :: formatUnused() was written with that 'Pre-Erase' in mind!

Odd there is a perf diff as the code already has blockIsBlank() check before erase? Maybe there is another path?
Code:
int LittleFS_SPIFlash::erase(lfs_block_t block)
{
    if (!port) return LFS_ERR_IO;
    void *buffer = malloc(config.read_size);
    if ( buffer != nullptr) {
        if ( blockIsBlank(&config, block, buffer)) {
            free(buffer);
            return 0; // Already formatted exit no wait
        }
        free(buffer);
    }
...

Here and in the NAND code:
T:\T_Drive\arduino-1.8.19\hardware\teensy\avr\libraries\LittleFS\src\LittleFS.cpp:
278: static bool blockIsBlank(struct lfs_config *config, lfs_block_t block, void *readBuf, bool full=true );
 
Look at the class: that is used for SPIFlash, not the onboard flash (LittleFS_Program).
Except for RAM they all call out to that?

static int static_erase(const struct lfs_config *c, lfs_block_t block) {
//Serial.printf(" flash er: block=%d\n", block);
return ((LittleFS_SPIFlash *)(c->context))->erase(block);
}
 
Except for RAM they all call out to that?

static int static_erase(const struct lfs_config *c, lfs_block_t block) {
//Serial.printf(" flash er: block=%d\n", block);
return ((LittleFS_SPIFlash *)(c->context))->erase(block);
}
No they don't. Each class has its own static_erase function. Think about it: they all have their own different erase procedures, they couldn't use the same function.
 
Each class
For each?
Code:
T:\T_Drive\arduino-1.8.19\hardware\teensy\avr\libraries\LittleFS\src\LittleFS.cpp:
  276  }
  277 
  278: static bool blockIsBlank(struct lfs_config *config, lfs_block_t block, void *readBuf, bool full=true );
  279: static bool blockIsBlank(struct lfs_config *config, lfs_block_t block, void *readBuf, bool full )
  280  {
  281      if (!readBuf) return false;
  ...
  342          uint8_t jjbit = 1<<(block%8);
  343          if ( !(checkused[iiblk] & jjbit) ) { // block not in use
  344:             if ( !blockIsBlank(&config, block, buffer, false )) {
  345                  (*config.erase)(&config, block);
  346                  jj++;
  ...
  369      for (unsigned int block=0; block < config.block_count; block++) {
  370          if (pr && progressChar && (0 == block%ii) ) pr->write(progressChar);
  371:         if (!blockIsBlank(&config, block, buffer)) {
  372              (*config.erase)(&config, block);
  373          }
  ...
  451      void *buffer = malloc(config.read_size);
  452      if ( buffer != nullptr) {
  453:         if ( blockIsBlank(&config, block, buffer)) {
  454              free(buffer);
  455              return 0; // Already formatted exit no wait
  ...
  550      void *buffer = malloc(config.read_size);
  551      if ( buffer != nullptr) {
  552:         if ( blockIsBlank(&config, block, buffer)) {
  553              free(buffer);
  554              return 0; // Already formatted exit no wait
  ...
  862      void *buffer = malloc(config.read_size);
  863      if ( buffer != nullptr) {
  864:         if ( blockIsBlank(&config, block, buffer)) {
  865              free(buffer);
  866              return 0; // Already formatted exit no wait

T:\T_Drive\arduino-1.8.19\hardware\teensy\avr\libraries\LittleFS\src\LittleFS_NAND.cpp:
  203  }
  204 
  205: static bool blockIsBlank(struct lfs_config *config, lfs_block_t block, void *readBuf, bool full=true );
  206: static bool blockIsBlank(struct lfs_config *config, lfs_block_t block, void *readBuf, bool full)
  207  {
  208      if (!readBuf) return false;
  ...
  360      void *buffer = malloc(config.read_size);
  361      if ( buffer != nullptr) {
  362:         if ( blockIsBlank(&config, block, buffer)) {
  363              free(buffer);
  364              return 0; // Already formatted exit no wait
  ...
 1172      void *buffer = malloc(config.read_size);
 1173      if ( buffer != nullptr) {
 1174:         if ( blockIsBlank(&config, block, buffer)) {
 1175              free(buffer);
 1176              return 0; // Already formatted exit no wait
 
I don't see what you're getting at, I posted the static_erase function for the LittleFS_Program class above (with a couple of extra lines added) and it never calls blockIsBlank.
 
One other thing:
During testing I have managed to lock up the Teensy dozens of times while an erase/programming operation is in progress. Very often, this resulted in completely corrupting the LittleFS filesystem causing it to be reformatted on the next mount. In one case (which I've been able to reliably reproduce) it ends up with no free space while also containing no files. The only way I was able to recover from that was to add an explicit call to reformat it.
This all seems contrary to the claims made by LittleFS to be "fail-safe" and protected from errors caused by sudden power loss. I highly recommend anyone using it because of these "features" do their own extensive testing first.
 
managed to lock up the Teensy dozens of times
Yikes, that is unexpected and unfortunate!

Anything in particular going on at the time? Is that specific to Program/PCB Flash?
The 'Test Integrity' in examples ran some series of read write verify with file create and delete for hours and never came across that. Though I'd guess it was least run on that PROG media.
 
Yikes, that is unexpected and unfortunate!
Not really, considering I am purposefully enabling interrupts during erase/program and then triggering interrupts that have handlers located in FLASHMEM.

Since post #3 this thread has been about fixing the problem that writing to onboard flash requires having interrupts disabled for long periods of time, which can cause all sorts of problems if the Teensy needs to stay responsive to external signals.
 
I've decided to ditch LittleFS for the current project I'm working on; the lack of speed is just atrocious. I was seeing it take literally minutes just to rewrite 1KB of data. I don't know how it's taking so long, given it would probably take less time to erase the entire chip and write it entirely from scratch.

Since all I need is a linear 128KB block of space, I've switched to using SPIFTL. I give it a 2MB chunk of flash to work with and it abstracts it into 512-byte sector device (like a typical SD card or USB drive) with auto wear levelling. The longest update times I've seen are around 500ms.
 
Any chance you could make a test case that shows this LittleFS slowness? Or if not a complete program, at least describe the usage which led to this terrible result?

I've recently had a few conversations with people who were surprised to hear LittleFS can be so slow. Some were even in disbelief. My hope is to eventually create a web page that shows the various ways to do non-volatile data storage and what the benefits and trade-offs are for each.
 
Or if not a complete program, at least describe the usage which led to this terrible result?
I'm using the Teensy to emulate a playstation1 memory card. They hold 128KB of memory arranged as 1024x 128 byte sectors. The Teensy communicates with the console using SPI(-ish) over FlexIO and loads/stores the data to a 128KB memory array, with each store resetting a timer to trigger an EventResponder event 6 seconds later. When triggered that event writes dirty data from the memory array to a backing file (on flash in the LittleFS filesystem, which had 2MB total space).
I've scrubbed out all the LittleFS code now so I don't really want to add it back in, but the sticking point is that the writes were modifying the file in random locations rather than simply appending data (or rewriting the entire file), which results in LittleFS relocating all the file contents from that point onwards.
This has been a loooong running issue with LittleFS: https://github.com/littlefs-project/littlefs/issues/27

I would also note that simply opening a file with mode = FILE_WRITE or FILE_WRITE_BEGIN updates the modification time attribute which triggers a flash write. That's horrible for a flash filesystem, especially since the date/time used will likely be wrong unless the Teensy has a battery or was just programmed, but also because that should only be done on flush or close if the file was actually modified.

Another issue was when I tested using a 4KB block size for the filesystem, then switched back to 64K, LittleFS still recognized the old filesystem as valid but with no free space - it writes the blocksize into the superblock but never bothers to check that it matches the current configuration.
 
Last edited:
Something I did wonder about before I ditched it: LittleFS allows using a statically assigned programming buffer rather than dynamically allocating one. FlexSPI apparently allows mapping the TX FIFO to AHB memory, although I'm a bit unclear on exactly how this works: the region it gets mapped to is 4MB but the FIFO is only 128 bytes, is it readable as well as writable, does the contents automatically "move" as the FIFO is progressed, etc... But if it does work as a "generic" mapping it might be worth enabling it so LittleFS can effectively prime the TX FIFO directly when preparing to program a page.
 
Back
Top