Teensy 4.0 First Beta Test

Status
Not open for further replies.
SPI.cpp work to do:

I ran SPI test with SerialFlash on T3.4 and T4. Despite T4 clock running at 37 mhz and T3.2 only at 30 mhz, the T3.2 did faster page reads than T4 (83 us vs 129us). T3 SPI lib is optimized to use FIFO so there are 16-bit transfers with small interframe gap (76ns). The T4 SPI still needs to be optimized, since it does simple byte transfers with 260 ns interframe gap. So T4 SPI needs to use its FIFO and/or 16 or 32-bit frame size. Also SPI.cpp needs T4 support of async/DMA transfer.

EDIT: Kurt had pending pull-request (now merged) to use FIFO. I need to re-run SerialFlash test ...
OK, I fetched latest SPI lib with Kurt's update (but imrxt.h does not have his additions yet, so some hacking required) but here is summary of SPI performance with FIFO (1000-byte transfer MISO jumpered to MOSI)

Code:
         SPICLOCK 4 MHz   CCR freq 4.0 MHz
         2060 us  3.88 mbs
         SPICLOCK 8 MHz   CCR freq 7.5 MHz
         1100 us  7.27 mbs
         SPICLOCK 13 MHz   CCR freq 12.6 MHz
         680 us  11.76 mbs 
         SPICLOCK 16 MHz   CCR freq 15.1 MHz
         570 us  14.04 mbs
         SPICLOCK 20 MHz   CCR freq 18.9 MHz
         470 us  17.02 mbs 
         SPICLOCK 30 MHz   CCR freq 25.1 MHz
         350 us  22.86 mbs
         SPICLOCK 40 MHz   CCR freq 37.7 MHz
         270 us  29.63 mbs
Data rate is getting close to SPI clock rate (actual rate is CCR frequency). Checked CLK with scope, interframe gap alternates between 48ns and 88ns @37.7 MHz. Time to send 256-byte page would be 69 us. Now T4 outperforms T3.2 ;)

Here is SPI CLK @37.7mhz on scope (you have to zoom scope in on sawtooth to see it report 37 mhz). You can see the reduced Vpp and the interframe gap variation.
t4spi.png

SPI DMA data rates from post #717
Code:
      SPI CLOCK 4000000 CCR freq 4.0 MHz
      tx 1024 samples 1990 us  4.1 mbs   scope clock 3.97 mhz
      SPI CLOCK 8000000 CCR freq 7.5 MHz
      tx 1024 samples 1080 us  7.6 mbs  scope 7.57 mhz
      SPI CLOCK 16000000 CCR freq 15.1 MHz  
      tx 1024 samples 570 us  14.4 mbs   scope 14.9mhz
      SPI CLOCK 30000000 CCR freq 25.1 MHz
      tx 1024 samples 370 us  22.1 mbs  ? scope 25 mhz  Vpp 2.74 v
      SPI CLOCK 40000000 CCR freq 37.7 MHz
      tx 1024 samples 260 us  31.5 mbs  scope 37.9  Vpp 2v
 
Last edited:
ADAFRUIT ILI9341 Libraries

Was back at this morning, fresh with the ILI9341. As a test i benchmarked it against a T3.5.

Code:
                         T3.5(120)                   T4
Benchmark                             Time (microseconds)
                                                           PODF(2)       PODF(6)
Screen fill              993332           885880           888130        993390      
Text                     59506            43820            44620         48940
Lines                    551647           416280           424020        464270
Horiz/Vert Lines         81940            72440            72680         81230
Rectangles (outline)     52453            46120            46300         51700
Rectangles (filled)      2061753          1838640          1843310       2061770
Circles (filled)         252396           208710           211020        233870
Circles (outline)        240152           182130           184900        203190
Triangles (outline)      124360           94820            96410         105790
Triangles (filled)       683284           597780           600630        670140
Rounded rects (outline)  109180           86970            87990         97200
Rounded rects (filled)   2055773          1828310          1833510       2050110
Done!

ILI9341_t3 lib on a T3.6 @120Mhz (Post #559)
Code:
				_t3		_t3n		_t3n (FB)
Screen fill              	224906		224954
Text                     	11223		11400
Lines                    	58387		58377
Horiz/Vert Lines         	18382		18398
Rectangles (outline)    	11683		11695
Rectangles (filled)      	462091		462246		63073
Circles (filled)         	69017		70225
Circles (outline)        	53691		55378
Triangles (outline)      	14104		14115
Triangles (filled)       	153623		154156
Rounded rects (outline)  	24579		25029
Rounded rects (filled)   	504445		504954		73661

Also I retested the bitmap loading example and it successfully loaded and displayed on the T4. So I would say the NEW Adafruit ILI9341 library works with the T4.

Do not try this test with the one downloaded with Teensyduino. Grab the new version off adafruits GitHub.

EDIT: Forgot to mention something. The T4 performance is with Frank B's recommended clock changes.

EDIT2: Update to latest SPI Lib changes from KurtE (not PR39), as well his PRs for avr_emulation and imxrt.h changes. :) Also, decided to test effects of changing PDOFs on the T4 based on FrankB's suggestion. Funny, with the changes its a little slower at PDOF(2)? At PDOF(6) its about the same as the T3.5.

EDIT3: Added ili9341_t3 results from https://forum.pjrc.com/threads/2630...splay)-library?p=176865&viewfull=1#post176865
 
Last edited:
ADAFRUIT ILI9341 Libraries

Was back at this morning, fresh with the ILI9341. As a test i benchmarked it against a T3.5.

Did you use the latest SPI lib with Kurt's FIFO mods. Unless imrxt.h has been updated, you'll need to hack a bit to get Kurt's SPI definitions. I think others have been working on the "optimized" ILI9341. How do your adafruit numbers compare with that? I'll have to search this thread to see if benchmark data was provided ...
 
IntervalTimer Performance

I did a quick check of the IntervalTimer performance. Unfortunately results do not look very good :-/

itimer.PNG


I estimated the processor load by measuring the number of cycles a simple loop takes with the IntervalTimer running in the background relative to the number of cycles the same loop takes without the IntervalTimer running. The ISR is an empty function. If I don't have a systematic error in the approach, it looks like the IntervalTimer in the T4 is significantly less performant than the T3 one. Can it be that the new "one ISR" workaround decreases the performance that much?

Here the test code (hopefully somebody finds an error in it...)


Code:
IntervalTimer t1; 

void test()  // dummyfunction
{
  //digitalWriteFast(0, !digitalReadFast(0));  
}

volatile int dummy;

void setup()
{  
  while(!Serial);

  // required for T3.6
  // ARM_DEMCR |= ARM_DEMCR_TRCENA;
  // ARM_DWT_CTRL |= ARM_DWT_CTRL_CYCCNTENA;

  constexpr unsigned loops = 100000;

  // Measure cycles required for loop without any interrupts
  noInterrupts();
  uint32_t start = ARM_DWT_CYCCNT;
  for (unsigned i = 0; i < loops; i++)
  {
    dummy++;
  }
  uint32_t end = ARM_DWT_CYCCNT;
  float base = end - start;  
  
  // activate interrupts and IntervalTimer
  interrupts();      
  t1.begin(test, 5);    

  start = ARM_DWT_CYCCNT;
  for (unsigned i = 0; i < loops; i++)
  {
    dummy++;
  }
  end = ARM_DWT_CYCCNT;
  
  float load = base/(end - start);
  Serial.printf("Load: %.3f",100.0*(1-load));
}

void loop()
{
}
 
I did a quick check of the IntervalTimer performance. Unfortunately results do not look very good :-/

Interesting. How does behavior change with compiler optimizations? Presumably the effect on T3.6 would be about the same on T4. Also, if you print value of dummy in your printf, you may see numbers change. compiler may ignore dummy+ in loop if value is never used. (T4 goes from 18.718 to 24.45 with Faster) also might print out value of base just as a sanity check

maybe we need to do some ISR latency measurements ....
 
Last edited:
That was done with -O3 (fastest). Did check with -O2 (faster) first and found basically the same (T3.6 much better than T4, didn't note the numbers). The compiler can not ignore dummy++ since dummy is volatile. Direct measurement with a logic analyzer of the execution time of the two loops gives 1.08ms and 1.31ms which also gives 17.5% load for the T4, so the cycle counter code seems to work ok...
 
Originally Posted by mjs513
ADAFRUIT ILI9341 Libraries

Was back at this morning, fresh with the ILI9341. As a test i benchmarked it against a T3.5.
Did you use the latest SPI lib with Kurt's FIFO mods. Unless imrxt.h has been updated, you'll need to hack a bit to get Kurt's SPI definitions. I think others have been working on the "optimized" ILI9341. How do your adafruit numbers compare with that? I'll have to search this thread to see if benchmark data was provided ...

Was just getting the link he posted for his SPI lib to load it up. Was looking at the ili9341_t3 as well this morning and most of changes involve the T3.x functions pushr and popr. I will posted updated numbers after I update to KurtE's changes.

Will have to do a search for the optimized benchmarks.

Mike
 
Sorry I did not get a chance to respond much up here yesterday. Was having fun pulling down some drywall in our shop building to figure out why... Still more to go...

SPI Speeds - @Paul pulled in the SPI library changes yesterday, but it looks like he has not yet pulled in the PR (https://github.com/PaulStoffregen/cores/pull/320) in Cores which defined the SPI hardware data structure, so you may need to manually pull this in... Or potentially I could put a check in the SPI file for it existing and if not, define it locally in there as well..

The changes were pulled in for SPI.transfer(buf, retbuf, cnt), to fill the fifo... However I have not yet done it to do the packing of the data into 16 bit entries on the queue, like I did on T3.x, It might slightly help. Likewise on here could maybe pack 32 bit entries. So far I kept it simple, as to not have to have two parts in it, to handle normal and reversed bytes, and cases where you are sending odd number of bytes...

Adafruit_ili9341 library - I had it running as mentioned in #655
But it again used the changes in the PR 320 (mentioned above). In that it added some emulation for SPSR, SPDR registers, which was used

As for code like:
Code:
    clkport     = portOutputRegister(digitalPinToPort(_sclk));
    clkpinmask  = digitalPinToBitMask(_sclk);
    mosiport    = portOutputRegister(digitalPinToPort(_mosi));
    mosipinmask = digitalPinToBitMask(_mosi);
    *clkport   &= ~clkpinmask;
    *mosiport  &= ~mosipinmask;
As Paul mentioned, yes faster. On T3.x also safe - as these use the bitbands and as such you did not have to worry about other things like other pin states being tromped on as the operation only updated one logical byte... Where as on T4 without BITBAND and assuming it is actually talking to the DATA(DR) register, doing something like: *clkport &= ~clkpinmask
Will first read the state of the DR, and off the appropriate bit and then write it back out. Problem is suppose an interrupt happened in between the time you fetched the DR, that changed the state of one or more of the other pins in that port, before you wrote the data back out... The write back operation would tromp on those changes...

As Paul mentioned M7 does not have bitband... But instead they sometimes instead added Set and Clear and Toggle registers (DR_SET, DR_CLEAR, DR_TOGGLE) in the GPIO cases, which when you write to a specific bit in those registers (it only updates that bit in the port, to as you can guess, set, clear or toggle that one (or more if more bits are set to 1) bit, which again is safe.
As an Example: Paul has used this in the OneWire library, which I copied into, HardwareSerial.cpp (for directly setting/clearing) direction signal...

Now back to playing:

Question: Does anyone have working examples of SPI Dma code I could look at and/or some FlexIO code... Would be fun to try out making either an extra SPI or Wire or... And see how well it works
 
That was done with -O3 (fastest). Did check with -O2 (faster) first and found basically the same (T3.6 much better than T4, didn't note the numbers). The compiler can not ignore dummy++ since dummy is volatile. Direct measurement with a logic analyzer of the execution time of the two loops gives 1.08ms and 1.31ms which also gives 17.5% load for the T4, so the cycle counter code seems to work ok...

ooops, i didn't notice the volatile. but i did notice base count didn't change with dummy in printf, BUT why did 2nd loop slow with dummy in printf ??
Code:
         Load: 18.718  base 650006.000000  799692 537001948
          with dummy in print
          Load: 24.586  base 650006.000000  861921 200000
on T3.6
            Load: 3.829  base 800233.000000  832092 1522224803
            with dummy
            Load: 14.530  base 800245.000000  936293 200000
IDE using Faster
Code:
IntervalTimer t1;

void test()  // dummyfunction
{
  //digitalWriteFast(0, !digitalReadFast(0));
}

volatile int dummy;

void setup()
{
  while (!Serial);

  // required for T3.6
 // ARM_DEMCR |= ARM_DEMCR_TRCENA;
 // ARM_DWT_CTRL |= ARM_DWT_CTRL_CYCCNTENA;

  constexpr unsigned loops = 100000;

  // Measure cycles required for loop without any interrupts
  noInterrupts();
  uint32_t start = ARM_DWT_CYCCNT;
  for (unsigned i = 0; i < loops; i++)
  {
    dummy++;
  }
  uint32_t end = ARM_DWT_CYCCNT;
  float base = end - start;

  // activate interrupts and IntervalTimer
  interrupts();
  t1.begin(test, 5);

  start = ARM_DWT_CYCCNT;
  for (unsigned i = 0; i < loops; i++)
  {
    dummy++;
  }
  end = ARM_DWT_CYCCNT;

  float load = base / (end - start);
  Serial.printf("Load: %.3f  base %f  %u %d", 100.0 * (1 - load), base, end - start, dummy);
}

void loop()
{
}
 
Last edited:
Was just getting the link he posted for his SPI lib to load it up. Was looking at the ili9341_t3 as well this morning and most of changes involve the T3.x functions pushr and popr.
Mike

As was mentioned earlier in this thread. Most of the speed gains in this library were gained by the ability of the T3.6 to encode the state of the DC pin (and to smaller extent the CS) pin into the FIFO queue for the SPI. Without this, with each time you need to change the DC state you have to wait for everything to fully output before changing the state, then change the state and then again start outputs again. It was during this waiting periods that caused the slower outputs... Note: I am not looking at the sources currently, but if my memory is correct, for example each time you do something like: draw a pixel, the output is something like: <SET ROW COMMAND> XX XX <Set Column COMMAND> YY YY <Write Mem> COLOR COLOR.
Again I may be wrong if it sets Row first or Column first. But for each of these commands it has to assert and deassert the DC pin... As I mentioned the CS pin is involved as well, but this is often done at the beginning and ending of an SPI transaction so not as often...

So with T4, the LPSPI fifo queue writes to TDR does not encode CS data... So not as easy. However the transmit FIFO can have both TDR data as well as TCR(Transmit Command Register) can control one CS pin (PCS field). Unclear how much control we could do with this. But might be able to hack it, where we set the only CS pin defined for LPSPI3 (10) - into SPI mode, and then before DC we set the PCS value to 0 - to assert this during the next transfer (again use for DC for one byte), then change PCS value to something other than 0...

So for the above example: We might be able to encode this data to FIFO like:
(TCR: PCS=0, FRAMESIZE=7) <SET Row Command> (TCR: PCS=3, FRAMESIZE=15) X16 bits (TCR: PCS=0, FS=7) <COL COMMAND> (TCR 3 15) Y16 ...

Note sure if that makes sense and/or if it would work... But this something that at some point would like to experiment with. Assuming someone else does not beat me to it ;)
 
Morning Kurt

Had drywalling and spackling!!!!

Adafruit_ili9341 library - I had it running as mentioned in #655 But it again used the changes in the PR 320 (mentioned above). In that it added some emulation for SPSR, SPDR registers, which was used
Yep I pulled it in but still got those errors I mentioned. Seems to be overtaken by events, since once I updated the adafruit libraries really became a non-issue since they worked - again has you SPI changes in the core.

But instead they sometimes instead added Set and Clear and Toggle registers (DR_SET, DR_CLEAR, DR_TOGGLE)
Now I what those things really do :)

Thanks for the additional info on the constructs I mentioned in the old Adafruit library.
 
@KurtE and @maniou

I updated the performance numbers for the Adafruit ILI9341 library in post #827.

In the ILI9341_t3 lib there are a lot of pushr and popr's, it looks like there is no direct equivalent in the T4, closest I saw was using TDR and TCR.

Mike
 
EDIT2: Update to latest SPI Lib changes from KurtE (not PR39), as well his PRs for avr_emulation and imxrt.h changes. :) Also, decided to test effects of changing PDOFs on the T4 based on FrankB's suggestion. Funny, with the changes its a little slower at PDOF(2)? At PDOF(6) its about the same as the T3.5.

EDIT3: Added ili9341_t3 results from https://forum.pjrc.com/threads/2630...splay)-library?p=176865&viewfull=1#post176865

As a sanity check, maybe print LPSPI4_CCR to see what divisor really is? what value does ILI9341 lib use to set SPI CLK frequency?
 
Thanks, gives me a few more hints. Trying to add in the SPI.transfer(tx, rx, cnt, event) API for T4... Will be interesting to see how all of these things interact.
The structures do look similar to T3.x stuff. So started from there. But helps to see how you handled some of the differences in the SPI registers. In particular the DER register versus on T3.x you had to logically enable interrupt and then say that that was handled by DMA...

Also on T3.x - even for TX only requests Transfer(buf, NULL, cnt, event), I still had an RX transfer where it went to a single byte RX buffer. Found we needed this as the transfer had not fully completed, when the TX ISR was called, so if the user did something like clear the CS pin, it was done before the transfer had completed.

@KurtE and @maniou

I updated the performance numbers for the Adafruit ILI9341 library in post #827.

In the ILI9341_t3 lib there are a lot of pushr and popr's, it looks like there is no direct equivalent in the T4, closest I saw was using TDR and TCR.

Mike
Yep - As I mentioned in #839 there may be some real hacking available to try out. Also forget to mention, that I believe @Frank is maybe working on a DMA version, like his ili9341_dma...
 
@KurtE

So with T4, the LPSPI fifo queue writes to TDR does not encode CS data... So not as easy. However the transmit FIFO can have both TDR data as well as TCR(Transmit Command Register) can control one CS pin (PCS field). Unclear how much control we could do with this. But might be able to hack it, where we set the only CS pin defined for LPSPI3 (10) - into SPI mode, and then before DC we set the PCS value to 0 - to assert this during the next transfer (again use for DC for one byte), then change PCS value to something other than 0...

So for the above example: We might be able to encode this data to FIFO like:
(TCR: PCS=0, FRAMESIZE=7) <SET Row Command> (TCR: PCS=3, FRAMESIZE=15) X16 bits (TCR: PCS=0, FS=7) <COL COMMAND> (TCR 3 15) Y16 ...
Looks like our posts are crossing somehow. That's actually further than I got with the T4 SPI so far. So you are ahead of me - at least I got the right registers :)
 
I did a quick check of the IntervalTimer performance. Unfortunately results do not look very good :-/
If one prints out the number of calls to ISR, one alo realized that ISR calls are at least 3 times more for T3.6 than done by T4 (T3.6 at 180 MHz,faster)
 
added it to post #835

Had a look at the generated code. Due to the printf it optimizes the second loop slightly different from the first loop (probably since it needs dummy outside of the loop for the printf). I copied the printf from below the second loop to the first one, now there is no difference in the load if you print dummy or not. (Stays at 18.7%)

So question still is what is the reason for the significant difference in the processor load.
 
@manitou
Did something better than check the register. Put a scope on the sck line:
pdof(6) = 7.547 Mhz
pdof(2) = 8.000 Mhz

SPI setting in the lib is 8Mhz.
 
Did another check and plotted LDVAL against the measured period of the timer

itimer2.PNG

The T4 "saturates" at about LDVAL=30, i.e. about 1.2 µs -> 830kHz.
The T3.6 goes down to about LDVAL = 10, i.e. about 0.5 µs -> 2MHz.
 
So question still is what is the reason for the significant difference in the processor load.

Sounds like we need an ISR latency study for the T4 .... ISRs in the NXP SDK often had memory barrier (dsb) at end of ISR. On many of my T4 peripheral ISR's i had to spin to wait for ISR register to clear or I would get multiple interrupts.
NXP paper on 1050 ISR latency https://www.nxp.com/docs/en/application-note/AN12078.pdf

earlier post https://forum.pjrc.com/threads/54711-Teensy-4-0-First-Beta-Test?p=194818&viewfull=1#post194818
 
Last edited:
Status
Not open for further replies.
Back
Top