Teensyduino 1.54 Beta #5

Status
Not open for further replies.
https://forum.pjrc.com/threads/60021-Program-Flash-W25Q-speed

bootdata.c,
0x00030401, // lutCustomSeqEnable,serialClkFreq,sflashPadType,deviceType

Change that "3" to a higher number up to "8".

If want to test that for a benchmark or similar, you must keep in mind that the cache is enabled. It's not easy to test.
It plays a role for non-sequential random accesses. For data, for example.
I'd see it from the other side: Why put on a brake that is not necessary? This makes no sense.

Agreed that it is not quite right to artificially limit it - if testing show it stable at closer to SPEC speeds.
LittleFS will process some megabytes in a matter of seconds.
Test just done on QSPI chip did "Bytes read 219461632, written 49676288" in 3.27 minutes or 1,300,183 I/O Bytes/sec so the 32KB cache will only be of 'some' help, and a quick 'n'o Verify will drop the read back after write so it would almost never be in the same place twice for the cache to help.

So :: 0x00060401, // lutCustomSeqEnable,serialClkFreq,sflashPadType,deviceType
or :: 0x00080401, // lutCustomSeqEnable,serialClkFreq,sflashPadType,deviceType

Would change the speed from ? to ?? and ??.
- maybe it is in the link ... but ... time for Z's
 
Good night :)

A benchmark would involve reading some megabytes of random adresses in the the program flash area.
A simple test with coremark running in flash showed 10% improvement - but that was very quick and dirty.

But: A benchmark is not needed if you switch off an artificial brake.
 
...

So :: 0x00060401, // lutCustomSeqEnable,serialClkFreq,sflashPadType,deviceType
or :: 0x00080401, // lutCustomSeqEnable,serialClkFreq,sflashPadType,deviceType
...

From the linked post ::
0x00030401, // lutCustomSeqEnable,serialClkFreq,sflashPadType,dev iceType << 60 MHz
0x00080401, // lutCustomSeqEnable,serialClkFreq,sflashPadType,deviceType << 133 MHz

Will do a baseline at 60 MHz then repeat at 133 MHz on LittleFS _Program for 'reference' and see it there is a measurable change worth looking into
 
What is the thinking on a release date for 1.54 (not beta, but actual release)?

The inclusion of the Greiman SD in teensyduino is a game changer for me with my Tympan stuff. It's a huge leap forward to have it in teensyduino cpared to my own kludge of stuffing it into the Tympan library.

I've now pulled it all out of the Tympan library so that it relies upon the version in 1.54 Beta. But, I don't feel that I can release my own updated library until the Teensyduino stuff leaves Beta.

Is there a schedule (even a notional one) of when this SD stuff will be part of a full Teensyduino release?

Excited!

Chip

And, any chance of the non-emulated MTP Serial option (direct from the WMXZ repo) being folded into Teensyduino instead of the emulated serial? There are certain serial interactions that break when using emulated serial.

Chip
 
Good night :)

A benchmark would involve reading some megabytes of random adresses in the the program flash area.
A simple test with coremark running in flash showed 10% improvement - but that was very quick and dirty.

But: A benchmark is not needed if you switch off an artificial brake.

@Frank B - just see look for obvious benefit in something somehow testable with the work in progress for LittleFS - results are here : pjrc.com/threads/58033-LittleFS-port-to-Teensy-SPIFlash
Running Program Flash at 133 versus current 60 MHz seems to make no significant/detectable difference.

Run time for both was 2.75 Min doing file create, extend, delete in some pattern across 26 files with writes of ~5K to 310K and reads that big or larger on each write and read again on delete: Bytes read 155,625,472, written 49,676,288

There should have been SOME change ... it almost seems like the QSPI is being held back as similar external FLASH timing changes have little effect and SPI is not always half the speed of QSPI.
 
And, any chance of the non-emulated MTP Serial option (direct from the WMXZ repo) being folded into Teensyduino instead of the emulated serial? There are certain serial interactions that break when using emulated serial.

Chip

Actually I hope it is not directly... i.e. that it has its' own Product ID and not use the all of everything PID...

Hopefully at some point in the near future, would be great if we had the MTP stuff as part of Teensyduino.
 
MTP per se does not need Serial connection, but MTP is usually integrated in other programs that need serial connections.

Seremu without Arduino monitor is a little bit complicated.
Consider headless hosts in embedded remote devices.
AFAIK, there is no standard terminal program on standard hosts that is capable of Seremu. Sure there are work around for Linux (cat / echo), but for me that is not a good replacement of a terminal program.
For USB_MTPDISK_SERIAL it first needs a own PID, then download needs to be aware of this PID.
At that point we could test and eventually generate PR to cores.
 
There should have been SOME change ...

It just says
a) the test is not good and perhaps measures the cache instead or something else.
or
b) There was a additional change that makes something slow, as my test back in march showed a speedup.

I don't care. If you want 60MHz for the flash, just leave it as it it. The reason why I mentioned it here because there was no reaction, back in march. That's all.
Don't want to pollute this thread here with unwanted discussions.

Edit: If you use even higher numbers, you'll see the Teensy crashing. I take this as a proof that the clock changes.
Edit2: Nope, i re-read the manual. The highest allowed number is 9.
 
Last edited:
I'm trying to implement a callstack-print - as a first step, for hardfaults only.
Paul, during the Teensy 4.0 beta phase you said we can have more startup_hooks - would you mind to add one, for this purpose?
( after usb_init() )
I know there is the late hook - but it would be better to add a new one, reserved for this.

My idea is to copy the needed information to OCRAM, which survives a soft reset.
The hardfault just copies some data to OCRAM, then resets.
After reset, when the CPU is in a clean state again and USB is initialized, the hook would check if there is any information and print it.
 
That seems like a worthy adventure Frank! You can certainly start with the late hook. I suggested those two spots because the processor arrived at those two spots only to wait in the while( time... ) - so adding short code there for user would have no impact for the parts of the system that are usable.

Having another weak debug_hook() there to do the check for fault_exit would be nice - though not much you can do there - unless UART output or configured for I/O activity - until USB online in setup() given the RAM2 (OCRAM?) will still be safe? It almost seems the startup_late_hook() would be too early? If called just before main() then everything is usable (once serial is online) and you could immediately precede setup() and act before any code there without the user needing to add any setup() code?

Note: Paul already reserved a HAB boot storage area in the .ld for RAM2 for recording a failure in that - you probably saw that though.
 
It detects and prints a mpu nullptr trap already, after reset. So, I think I can get the rest working, too. I'll open a new thread when it is time for it.
Code:
Callstack Data found:
Size: 46 CSFR: 0x82 XPSR: 0x0 
non usage fault
      (DACCVIOL) Data Access Violation
      (MMARVALID) MemMange Fault Address Valid: 0x0
Have to pause now, time for a family visit.

Note: Paul already reserved a HAB boot storage area in the .ld for RAM2 for recording a failure in that - you probably saw that though.
Yup!
Can I use that area for my data? I think the HAB boot storage area is useful only after Power-on - is that right? If so, can i overwrite it?
 
Last edited:
Run time for both was 2.75 Min doing file create, extend, delete in some pattern across 26 files with writes of ~5K to 310K and reads that big or larger on each write and read again on delete: Bytes read 155,625,472, written 49,676,288

There should have been SOME change ... it almost seems like the QSPI is being held back as similar external FLASH timing changes have little effect and SPI is not always half the speed of QSPI.

Your LittleFS benchmarks are measuring the flash chip's page write and sector erase time.

To test this, you would need to write a large file to the chip, then measure the speed of many seek-then-read operations at various locations within the file. Separate those seek positions by at least 2K (the FlexSPI's internal buffer).


Edit: If you use even higher numbers, you'll see the Teensy crashing. I take this as a proof that the clock changes.
Edit2: Nope, i re-read the manual. The highest allowed number is 9.

The original reason for 60 MHz was Table 37 in the electrical specs.

sc.png

Also, in the early days I had imagined we would use QPI & DTR mode, which only supports about half the maximum clock speed. But I abandoned that plan after I discovered QPI mode confuses NXP's ROM. I considered making the bootloader able to detect & recover, but ultimately decided that would be too risky - and it would utterly stymie people from running their code on boards without the bootloader chip.

But we are indeed running with RXCLKSRC = 01. I double checked just now. And as far as I know, we're using only the features Winbond rates to run at max speed. So it looks like we can go up to 133 MHz.

However, we've been running with less than the recommended value of 3 in the CS signals setup and hold times.

sc2.png

Let's give 133 MHz a try, but with those at the recommended value.

https://github.com/PaulStoffregen/cores/commit/c346fc36ed97dcaed2fa1d70626fbd80cf35586d


And just to be clear, there's no way I will consider making 166 MHz the default, since both Table 38 and Winbond's datasheet say 133 MHz is the maximum. 166 MHz can only be used (safely within spec) on special flash chips providing a dedicated DQS signal.
 
Last edited:
PaulStoffregen said:
To test this, you would need to write a large file to the chip, then measure the speed of many seek-then-read operations at various locations within the file. Separate those seek positions by at least 2K (the FlexSPI's internal buffer).
First thanks for the explanation. Really does help.

Have one question what is your definition of "MANY", have a feeling my definition would be a lot less than yours :)
 
Even 1 seek+read from the middle of a 16kbybe file might give useful timing if measured with the ARM cycle counter.

My gut feeling is the timing should be pretty consistent, assuming reads are done far enough apart to avoid the FlexSPI buffer and access to PSRAM isn't competing for the QSPI bandwidth.

Like developing any benchmark, experimentation and careful thought is needed when results aren't as expected. But if the timing is consistent, I'd just pick a number of iterations that lets the benchmark complete in a few seconds. No point wasting human time for diminishing returns.
 
Your LittleFS benchmarks are measuring the flash chip's page write and sector erase time.

To test this, you would need to write a large file to the chip, then measure the speed of many seek-then-read operations at various locations within the file. Separate those seek positions by at least 2K (the FlexSPI's internal buffer).

...

Let's give 133 MHz a try, but with those at the recommended value.

https://github.com/PaulStoffregen/cores/commit/c346fc36ed97dcaed2fa1d70626fbd80cf35586d


And just to be clear, there's no way I will consider making 166 MHz the default, since both Table 38 and Winbond's datasheet say 133 MHz is the maximum. 166 MHz can only be used (safely within spec) on special flash chips providing a dedicated DQS signal.

Made the linked 133 MHz changes in bootdata.c and it worked for LittleFS test as done before.

That LFSintegrity was designed to assure data integrity, and test the exposed LFS interface - was just easy to run those commands and get a general feel for 'Speed in use' in the LFS case. It does a series of general operations , writing 3MB to formatted media, deleting that, then reFormatted for first use of a series of file I/O - writing 49MB and reading 155 MB in 2.83 M

Posted more on the LittleFS thread but in rerunning Post #719 series Summary no speed change for better or worse.
 
Last edited:
Even 1 seek+read from the middle of a 16kbybe file might give useful timing if measured with the ARM cycle counter.

My gut feeling is the timing should be pretty consistent, assuming reads are done far enough apart to avoid the FlexSPI buffer and access to PSRAM isn't competing for the QSPI bandwidth.

Like developing any benchmark, experimentation and careful thought is needed when results aren't as expected. But if the timing is consistent, I'd just pick a number of iterations that lets the benchmark complete in a few seconds. No point wasting human time for diminishing returns.

Did a modification of the Bench sketch but using LittleFS_Program, and added in a random read timing loop. Write/Read blocksizes are 2048bytes and I used random to jump around the file (16k) in 2048 byte increments:
Code:
--------------------------------------------
LittleFS Test
Disk Stats:
Bytes Used: 8192, Bytes Total:6291456
Benchmark:
FILE_SIZE = 16384
BUF_SIZE = 2048 bytes

Starting write test, please wait.
write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
496.48,5214,3447,4086
512.00,5006,3446,3987
512.00,4018,3428,3987
496.48,4657,3557,4112
512.00,4554,3357,3987

Starting sequential read test, please wait.
read speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
115380.28,30,8,17
303407.41,8,6,6
309132.06,7,6,6
309132.06,7,6,6
309132.06,8,6,6
Done

Starting random read test, please wait.
Number of random reads: 1
Number of blocks: 8

read speed and latency
speed,max,min,avg
[B]Position (bytes), Block: 0, 0
Read Time (ARM Cycle Delata): 3959[/B]
KB/Sec,usec,usec,usec
1820444.50,0,0,1
[B]Position (bytes), Block: 8192, 4
Read Time (ARM Cycle Delata): 4223[/B]
KB/Sec,usec,usec,usec
1820444.50,0,0,1
[B]Position (bytes), Block: 10240, 5
Read Time (ARM Cycle Delata): 4707[/B]
KB/Sec,usec,usec,usec
1638400.00,0,0,1
[B]Position (bytes), Block: 2048, 1
Read Time (ARM Cycle Delata): 3953[/B]
KB/Sec,usec,usec,usec
1820444.50,0,0,1
[B]Position (bytes), Block: 6144, 3
Read Time (ARM Cycle Delata): 5068[/B]
KB/Sec,usec,usec,usec
1638400.00,0,0,1

Done
Ran it with and without the 133mhz change to core and really no difference. Without is below:
Code:
Starting random read test, please wait.
Number of random reads: 1
Number of blocks: 8

read speed and latency
speed,max,min,avg
Position (bytes), Block: 0, 0
Read Time (ARM Cycle Delata): 3959
KB/Sec,usec,usec,usec
1820444.50,0,0,1
Position (bytes), Block: 8192, 4
Read Time (ARM Cycle Delata): 4223
KB/Sec,usec,usec,usec
1638400.00,0,0,1
Position (bytes), Block: 10240, 5
Read Time (ARM Cycle Delata): 4707
KB/Sec,usec,usec,usec
1638400.00,0,0,1
Position (bytes), Block: 2048, 1
Read Time (ARM Cycle Delata): 3953
KB/Sec,usec,usec,usec
1820444.50,0,0,1
Position (bytes), Block: 6144, 3
Read Time (ARM Cycle Delata): 4929
KB/Sec,usec,usec,usec
1489454.50,0,0,1
Just is case I did something wrong heres the modified sketch:
Code:
#include <LittleFS.h>
#include <Streaming.h>
LittleFS_Program myfs;

#define cout Serial

char szDiskMem[] = "PRO_DISK";

// File size in bytes.
const uint32_t FILE_SIZE = 16 * 1024;

// Set SKIP_FIRST_LATENCY true if the first read/write to the SD can
// be avoid by writing a file header or reading the first record.
const bool SKIP_FIRST_LATENCY = true;

// Size of read/write.
const size_t BUF_SIZE = 2048;

// Write pass count.
const uint8_t WRITE_COUNT = 5;

// Read pass count.
const uint8_t READ_COUNT = 5;

//Block size for qspi
#define MYBLKSIZE 2048 // 2048

// Insure 4-byte alignment.
uint32_t buf32[(BUF_SIZE + 3)/4];
uint8_t* buf = (uint8_t*)buf32;

//Number of random reads
#define randomReads 1

File file, file1;



void setup() {
  while (!Serial) ; // wait
  Serial.println("LittleFS Test"); delay(5);
  if (!myfs.begin(1024 * 1024 * 6)) {
    Serial.printf("Error starting %s\n", szDiskMem);
  }

  //myfs.lowLevelFormat('.');
  
  float s;
  uint32_t t;
  uint32_t maxLatency;
  uint32_t minLatency;
  uint32_t totalLatency;
  bool skipLatency;
  
  myfs.remove("bench.dat");
  //for(uint8_t cnt=0; cnt < 10; cnt++) {
  
    // fill buf with known data
    if (BUF_SIZE > 1) {
      for (size_t i = 0; i < (BUF_SIZE - 2); i++) {
        buf[i] = 'A' + (i % 26);
      }
      buf[BUF_SIZE-2] = '\r';
    }
    buf[BUF_SIZE-1] = '\n';

    Serial.println("Disk Stats:");
    Serial.printf("Bytes Used: %llu, Bytes Total:%llu\n", myfs.usedSize(), myfs.totalSize());
    Serial.printf("Benchmark:\n");
    cout << F("FILE_SIZE = ") << FILE_SIZE << endl;
    cout << F("BUF_SIZE = ") << BUF_SIZE << F(" bytes\n");
    cout << F("Starting write test, please wait.") << endl << endl;

   // do write test
    uint32_t n = FILE_SIZE/BUF_SIZE;
    cout <<F("write speed and latency") << endl;
    cout << F("speed,max,min,avg") << endl;
    cout << F("KB/Sec,usec,usec,usec") << endl;

    // open or create file - truncate existing file.
    file = myfs.open("bench.dat", FILE_WRITE);
    
    for (uint8_t nTest = 0; nTest < WRITE_COUNT; nTest++) {
      file.seek(0);
  
      maxLatency = 0;
      minLatency = 9999999;
      totalLatency = 0;
      skipLatency = SKIP_FIRST_LATENCY;
      t = millis();
      
      for (uint32_t i = 0; i < n; i++) {
        uint32_t m = micros();
        if (file.write(buf, BUF_SIZE) != BUF_SIZE) {
          Serial.println("write failed");
        }
        m = micros() - m;
        totalLatency += m;
        if (skipLatency) {
          // Wait until first write to SD, not just a copy to the cache.
          skipLatency = file.position() < 512;
        } else {
          if (maxLatency < m) {
            maxLatency = m;
          }
          if (minLatency > m) {
            minLatency = m;
          }
        }
      }
  
      t = millis() - t;
      s = file.size();
      cout << s/t <<',' << maxLatency << ',' << minLatency;
      cout << ',' << totalLatency/n << endl;
    }
    cout << endl << F("Starting sequential read test, please wait.") << endl;
    cout << endl <<F("read speed and latency") << endl;
    cout << F("speed,max,min,avg") << endl;
    cout << F("KB/Sec,usec,usec,usec") << endl;

   // do read test
    for (uint8_t nTest = 0; nTest < READ_COUNT; nTest++) {
      file.seek(0);
      maxLatency = 0;
      minLatency = 9999999;
      totalLatency = 0;
      skipLatency = SKIP_FIRST_LATENCY;
      t = micros();
      for (uint32_t i = 0; i < n; i++) {
        buf[BUF_SIZE-1] = 0;
        uint32_t m = micros();
        int32_t nr = file.read(buf, BUF_SIZE);
        if (nr != BUF_SIZE) {
          Serial.println("read failed");
        }
        m = micros() - m;
        totalLatency += m;
        if (buf[BUF_SIZE-1] != '\n') {
          Serial.println("data check error");
        }
        if (skipLatency) {
        skipLatency = false;
        } else {
        if (maxLatency < m) {
          maxLatency = m;
        }
        if (minLatency > m) {
          minLatency = m;
        }
      }
     }
    
      s = file.size();
      
      t = micros() - t;
      cout << s*1000/t <<',' << maxLatency << ',' << minLatency;
      cout << ',' << totalLatency/n << endl;
    }
    
    cout << endl << F("Done") << endl;

   cout << endl << F("Starting random read test, please wait.") << endl;

      Serial.printf("Number of random reads: %d\n", randomReads);
      Serial.printf("Number of blocks: %d\n", n);
    
    cout << endl <<F("read speed and latency") << endl;
    cout << F("speed,max,min,avg") << endl;

    
    uint32_t tt;
    // do read test
    for (uint8_t nTest = 0; nTest < READ_COUNT; nTest++) {
      file.seek(0);
      maxLatency = 0;
      minLatency = 0;
      totalLatency = 0;
      skipLatency = SKIP_FIRST_LATENCY;
      t = micros();
      for (uint32_t i = 0; i < randomReads; i++) {
        buf[BUF_SIZE-1] = 0;
        uint32_t m = micros();
        uint32_t block_pos = random(0, (n-1));
        uint32_t random_pos = block_pos* MYBLKSIZE;
        cout << "Position (bytes), Block: " << random_pos << ", ";
        cout << block_pos << endl;
        uint32_t startCNT = ARM_DWT_CYCCNT;
        file.seek(random_pos);
        int32_t nr = file.read(buf, BUF_SIZE);
        uint32_t endCNT = ARM_DWT_CYCCNT;
        cout << F("Read Time (ARM Cycle Delata): ") << endCNT-startCNT << endl;
        if (nr != BUF_SIZE) {
          Serial.println("read failed");
        }
        m = micros() - m;
        totalLatency += m;
        if (buf[BUF_SIZE-1] != '\n') {
          Serial.println("data check error");
        }
        if (skipLatency) {
        skipLatency = false;
        } else {
        if (maxLatency < m) {
          maxLatency = m;
        }
        if (minLatency > m) {
          minLatency = m;
        }
      }
     }
    
    s = file.size();
    
    
    t = micros() - t;
    cout << F("KB/Sec,usec,usec,usec") << endl;
    cout << s*1000/t <<',' << maxLatency << ',' << minLatency;
    cout << ',' << totalLatency/n << endl;
    }
    cout << endl << F("Done") << endl;

    
    file.close();
  
}

void loop() {
  // put your main code here, to run repeatedly:

}
 
Actually now that I am thinking about it doesn’t safari have a streaming class. The cout came from the original bench sketch
 
Ick? - maybe will just suffer and change them to prinf()'s - luckily I didn't do that already with RAMDOM'ly updated code :)

Ok. I was right SdFat allows you do cout stuff.

Replace Streaming.h and #define cout with the following 2 lines:
Code:
#include <sdios.h>
// Serial output stream
ArduinoOutStream cout(Serial);
Now no need for Streaming.h. Its part of the SdFat library.
 
Ok. I was right SdFat allows you do cout stuff.

Replace Streaming.h and #define cout with the following 2 lines:
Code:
#include <sdios.h>
// Serial output stream
ArduinoOutStream cout(Serial);
Now no need for Streaming.h. Its part of the SdFat library.

Seems better but now is wants to do this ... have to kill the adafruit:
Code:
 Used: T:\tCode\libraries\SdFat_-_Adafruit_Fork
 Not used: T:\arduino-1.8.13_t54\hardware\teensy\avr\libraries\SdFat

All good and done - see LittleFS thread ...
 
Ok, seems my "hardfault/callstack" project died sooner than i thought.
gcc -unwind tables produces a linker error. So, it is not possible to walk through the stack, because there is no way to identify the location of the return adresses. But even gcc -mpoke function names produces a not-working binary, so I could'nt print the function names even if i could find the addresses..
All I can do is to print the type of the hardfault. Not too useful.. have to stop here.
 
Last edited:
Status
Not open for further replies.
Back
Top