Onboard flash can be faster...

jmarsh

Well-known member
Currently the read speed for onboard flash on Teensy 4.x is not as optimal as it could be.
It supports a "continuous read" mode where successive reads can skip sending the command byte (which is 0xEB for reads). Since this byte is sent in SPI mode using one data line, it takes 8 FlexSPI clocks (60ns / 36 CPU cycles @ 600MHz) for every read. FlexSPI supports this mode by using JUMP_ON_CS in the read LUT entry; the only tricky part is disabling continuous read mode if we want to send a different command to the flash (e.g. erasing/writing), but that's pretty easy to fix.

PR to implement it is here: https://github.com/PaulStoffregen/cores/pull/785
 
Another idea (I haven't tested this one) to allow interrupts to stay enabled during erasing/programming, reorder the LUT as follows:
- entry X: the erase/program suspend command (0x75) including the timing delay for tSUS
- entry X+1: the regular read command (without continuous reading enabled)
- entry X+2: the erase/program resume command (0x7A)

When programming begins, ARDSEQID and ARDSEQNUM for the FLSHA1CR2 register are pointed at the 3 LUT entries beginning at X. So if a read occurs during programming, the operation automatically gets suspended and resumed while the read is being performed. When programming finishes FLSHA1CR2 gets restored to the regular values.

(I guess you could try and fit all the commands into one LUT entry, I'm just not sure they would fit.)
 
@jmarsh - seemed interesting - grabbed the two files for IDE 1 building.
Three sketches worked.
#1 - current ST display sketch - maybe two of them - nothing special with FLASHMEM

#2 - LittleFS/Integrity/PROG - runs LittleFS with PROG FLASH storage - edited the code in 'functions' to be FLASHMEM to force more flash read.
Did not do any speed test of file read versus write - but the file data is error checked with reads after write.

#3 - the CODE4CODE.ino where FLASH is filled with 4000 FLASHMEM functions created to call down into each other. This was done to test effect of LOCKING on speed and function. It still works and <PR> runs a bit faster, based on the hinky math attempted to compare one set in RAM1 to those in FLASHMEM for the same function.

Current release code - more us:
Code:
    Not enabled _isr() char[] Testing.
Cascading 4000 calls took 5254362 us [3044818683 piCycles] : net 22715 us
Direct calls took 5227382 us [3041138942 piCycles] : net 5227377 us

    ENABLED _isr() char[] Test @50 us    _isr Cycles 347619320 of 582000082 : CPU %=0.597284
    ENABLED _isr() char[] Test @100 us    _isr Cycles 177148619 of 582000080 : CPU %=0.304379
    ENABLED _isr() char[] Test @200 us    _isr Cycles 87408130 of 582000080 : CPU %=0.150186
Cascading 4000 calls took 6172637 us [6131896 piCycles] : net {less Pi} 40741 us
    _isr Cycles 520992150 of 3592475233 : CPU %=0.145023
Cascading 4000 calls took 6172637 us : net {less isr} 3568763930 us
Cascading 4000 calls took 6172637 us : Cycles/call 1543

Direct calls took 6178916 us [6176388 piCycles] : net {less Pi} 2528 us
    _isr Cycles 542448320 of 3596129045 : CPU %=0.150842
Direct calls took 6178916 us [3594658348 piCycles] : net {less isr} 5246875 us
Direct calls took 6178916 us [3594658348 piCycles] : Cycles/call 1544

Altered LUT usage per <PR> - fewer us:
Code:
    Not enabled _isr() char[] Testing.
Cascading 4000 calls took 5253589 us [3044763703 piCycles] : net 22037 us
Direct calls took 5227378 us [3041136902 piCycles] : net 5227373 us

    ENABLED _isr() char[] Test @50 us    _isr Cycles 344390335 of 582000082 : CPU %=0.591736
    ENABLED _isr() char[] Test @100 us    _isr Cycles 175504696 of 582000081 : CPU %=0.301554
    ENABLED _isr() char[] Test @200 us    _isr Cycles 86591319 of 582000080 : CPU %=0.148782
Cascading 4000 calls took 6161680 us [6123445 piCycles] : net {less Pi} 38235 us
    _isr Cycles 515081061 of 3586098205 : CPU %=0.143633
Cascading 4000 calls took 6161680 us : net {less isr} 3563845279 us
Cascading 4000 calls took 6161680 us : Cycles/call 1540

Direct calls took 6168667 us [6166350 piCycles] : net {less Pi} 2317 us
    _isr Cycles 536511642 of 3590164013 : CPU %=0.149439
Direct calls took 6168667 us [3588815854 piCycles] : net {less isr} 5246826 us
Direct calls took 6168667 us [3588815854 piCycles] : Cycles/call 1542

So, the hinky math seems to be doing something right detecting les run time.
This T_4.1 is not Locked - but both run on same unit without errors detected so not relevant.

Code:
Memory Usage on Teensy 4.1:
  FLASH: code:1214588, data:328112, headers:8656   free for files:6575108
   RAM1: variables:83424, code:45992, padding:19544   free for local variables:375328
 
Back
Top