SPI speedup question

DrM

Well-known member
Hi,

The following is to ask about a speed up for looping over transfer16() with some other operations in between calls to transfer16(). The essential is that there seems to be about 150 nsec between calling transfer16() and the start of the transfer. I would like, in the case of a loop, to reduce that as much as feasible.

For a little bit of context, here is an except with some simplification, from some working code. It synchronizes to a clock, emits a pulse, and then call SPIN.transfer16(0xFFFF)

There is a bout 150 nsecs between the pulse and the start of the SPI clock. The goal is to reduce that.

Code:
#include "Arduino.h"

#include <SPI.h>

#define READPINA (CORE_PIN5_PINREG & CORE_PIN5_BITMASK)
#define SETPINB  (CORE_PIN6_PORTSET = CORE_PIN6_BITMASK)
#define CLEARPINB  (CORE_PIN6_PORTCLEAR = CORE_PIN6_BITMASK)

SPISettings spi_settings( 30000000, MSBFIRST, SPI_MODE0);   // 30 MHz, reads 

void setup() {

  pinMode(4,     OUTPUT);    // Clock for the CCD
  pinMode(5,     INPUT);        // Jumpered to pin 4
  pinMode(6,     OUTPUT);    // CNVRT to the ADC

  analogWriteResolution(4);          // pwm range 4 bits, i.e. 2^4
  analogWriteFrequency(4, 600000);
  analogWrite(4,8);              // dutycycle 50% for 2^4

  SPI.begin();                       // ADC connected to SPI
  SPI.beginTransaction(spi_settings);

}

void loop() {

   // other stuff, not relevant


   // the readout 

    for (i=0; i<NREADOUT; i++){ 

      while ( READPINA ) {}   // wait while high
      while ( !READPINA ) {}  // wait while low

      SETPINB;
      delayNanoseconds( 710 );
      CLEARPINB;

      *p16++ = SPI.transfer16(0xFFFF);      
    }

}



For the Teensy 4.0, the source for SPI.transfer16() is as follows (arduino-1.8.9):

Code:
	uint16_t transfer16(uint16_t data) {
		port().TDR = data;		// output 16 bit data.
		while ((port().RSR & LPSPI_RSR_RXEMPTY)) ;	// wait while the RSR fifo is empty...
		port().TCR = tcr;	// restore back
		return port().RDR;
	}

What I want to ask is, can I do the following (i.e. is it likely to work), and where (in source code) does port() come from?

Code:
void loop() {

  uint32_t tcr;
 

     tcr = port().TCR;
     port().TCR = (tcr & 0xfffff000) | LPSPI_TCR_FRAMESZ(15);  // turn on 16 bit mode 

    for (i=0; i<NREADOUT; i++){ 

      while ( READPINA ) {}   // wait while high
      while ( !READPINA ) {}  // wait while low

      SETPINB;
      delayNanoseconds( 710 );
      CLEARPINB;

      port().TDR = 0xFFFF;
      while ((port().RSR & LPSPI_RSR_RXEMPTY));

      *p16++ = port().RDR;
    }

    port().TCR = tcr;	// restore back


Thank you
 
What I want to ask is, can I do the following (i.e. is it likely to work)

Have you tried it? I don't think it will even compile because port() is a private member function of the SPI class. It has no meaning when called without referencing the class instance.

and where (in source code) does port() come from?

port() is defined on line 1374 of SPI.h
 
@ joepasquariello Great! Easy enough. Where do I find a value for port_addr?

When you're looking for the value of a private data member of a class, it's always good to start with the constructor. In this case, the class is SPIClass, and the first argument gets assigned to port_addr. Then, on line 1505 of SPI.cpp, you will see the definition of SPI, the instance of SPIClass you are using. So, if I'm right, and I'm not guaranteeing that I am because it's up to you to try it, the value you are looking for is IMXRT_LPSPI4_S. You still may need to understand the type of this object and how to reference its fields. The library does all of that for you.

Code:
SPIClass SPI((uintptr_t)&IMXRT_LPSPI4_S, (uintptr_t)&SPIClass::spiclass_lpspi4_hardware);
 
@joepasquariello Thank you, that's a huge time saver.

For a dedicated embedded system, I usually prefer to understand the part and have at its registers. But that does look a time sink in this instance (actually, it is almost always a time sink).

Another option would be to add the specialized routine to the library, say as code inserted immediatly after the code for transfer16(). Butt then I have to maintain that personal version of the library at each update.

And perhaps a third option, a derived class with new function.

Somehow I feel like the first is the easiest and least problematic unless I can convince the maintainers to keep an ADC block read (toggle a convert line before each word) in the library.
 
... unless I can convince the maintainers to keep an ADC block read (toggle a convert line before each word) in the library.

Maybe not. It really wouldn't make sense to turn the SPI library into a driver for a specific type of SPI device. The whole point is for SPI to hold what is common to all SPI, and to build libraries for various devices from there.

If I understand you correctly, your ADC begins its conversion on the rising edge produced by your SETPINB macro, so as long as your call to SPI.transfer16() completes before the next rising edge detected by READPINA, your loop will be okay, and speeding it up a bit more won't have any effect on the data. At 600 kHz PWM, your period is 1667 ns. You've got 710+ for the ADC conversion, then a bit over 500 ns for 16 bits over SPI at 30 MHz. That's 1210+, so you have 400-450 ns slack. This loop consumes 100% of the CPU. If you reduce the time between CLEARPINB and the first SPI clock, you'll just spend that time waiting for the next edge, so why not just use the SPI library as is? Are you trying to sample faster than 600 kHz?

Code:
    for (i=0; i<NREADOUT; i++){ 

      while ( READPINA ) {}   // wait while high
      while ( !READPINA ) {}  // wait while low

      SETPINB;
      delayNanoseconds( 710 );
      CLEARPINB;

      *p16++ = SPI.transfer16(0xFFFF);      
    }
 
@joepasquariello Yes, I want to run it as fast as possible.

Re the chip, that is close. The conversion indeed starts on the rising edge. But it takes about 700 nsecs to complete, and then it is available on the SPI.

The problem is the 150 nsec from CLEARPINB until the clock appears on the SPI clock pin. I want to get this going as fast as it can possible go. As it turns out this also sets the clock for the CCD, and there a little faster would be better.

But, I think I have the answer.

The CONVERT for this chip does not actually have to stay high. The leading edge starts the conversion. After that it finishes when it finishes. So, simply shorten the delayNanoseconds and everything should be okay. I can do that empirically on the scope until I get the SPI clock starting at about 750 nsecs, a nice safe margin on this time scale.


Thanks, that was helpful. Good to talk it over
 
The CONVERT for this chip does not actually have to stay high. The leading edge starts the conversion. After that it finishes when it finishes. So, simply shorten the delayNanoseconds and everything should be okay. I can do that empirically on the scope until I get the SPI clock starting at about 750 nsecs, a nice safe margin on this time scale.

Do you mean you there is a "conversion complete" output from the ADC that you can use to trigger the SPI read? That's typical.
 
If the chip is wired as in datasheet or in similar fashion
why do you bother with SPI? use I2S it is easier and runs continuously.
I have done that with other chips with similar SPI-type connection
Note, it says "SPI-Compatible Serial Communication" and not that you must use SPI
 
@WMXYZ That is interesting.

One issue is the 710-750nsecs for conversion. readout commences on the first clock after the convert pin is low.

So maybe that is enough. Sync with the clock in code, clear convert and start reading on the next clock. My first thought was use a logic gate to hold the clock until it is ready, but maybe it is not necessary.

Either way, the clock can be PWM or I2S or whatever can run at 50MHz or closest to it.


(Aside, the curious thing though, for me anyway, is why SPI takes 150 nsec to get started. The call to transfer16() is just a few register accesses,)
 
@joepasquariello So.... my shower thought this morning.... three new calls for the API, set16bit(), specialtransfer() and restore(). What do you think?

The other thought is that what I should do first is time the different parts of transfer16() and see where the 150 nsecs is going.

Meanwhile, I shortened the convert pulse as we discussed above, and that does seems to work. But also, you are right, the biggest part of it is the 710-750 nsecs for the convert. Shaving 100nsec off of the transfer is not going to make a big difference. What would make more of a difference is to get to the full 50MHz that the chip is spec'd for. Meanwhile, it works, it is fast enough and, so thank G-d, I think that is that for now.
 
@WMXYZ That is interesting.

One issue is the 710-750nsecs for conversion. readout commences on the first clock after the convert pin is low.

As per documentation, conversion (sync) clock can go high during clocking-out of data, so length is not an issue, as long is is longer than 1 bit clock.
 
(Aside, the curious thing though, for me anyway, is why SPI takes 150 nsec to get started. The call to transfer16() is just a few register accesses,)
Have you had a look at SPI.cpp?
This might have the answer to your question/conundrum.
 
Assuming you are still trying to use SPI...
I could be wrong, but not sure the special transfer16, will gain you a whole lot. Maybe the save, set, restore eats some cycles, but as compared to the rest of the stuff...

You might want to look at the RM pdf file at the CCR register, to see what some of the timings are set to. In particular the SCKPCS and PCSSCK settings.
Each transfer call logically:
<ASSERT PCS pin><DELAY PCSSSCK clocks>[DO THE transfer]<DELAY SCKPCS clocks><Unassert the PCS pin>

The lengths of the delays is those fields settings +1 SPI clocks
I think we default to 1/2 of a bit output time for both of these...
Might see what happens if you set both of these to 0...
 
@Kurte Thank you, that sounds like a good lead. It would be preferable, I think, compared to what I am doing about it now.
 
@joepasquariello Fantastic!!! Thank you so much. So then the hypothesis was right, the setup for 16 bits takes a big part of the 150 nsecs.

BTW those times are from the call to the start of the transfer, yes? The actual transfer is presumably still 16 bit/30 MHz ~ 533nsecs.
 
P/S Really, in a dedicated system with one device, even more so that it transfers in blocks, it hardly makes sense to have to save, set, transfer and restore every word.

And for a small MCU system like the teensy, this king of dedicated single purpose design has got be a not so uncommon use case.

So, I think therefore, this idea that there should be available a separate set16 call (and presumably its compliment, a set8 call), makes a lot of sense.
 
P/S Really, in a dedicated system with one device, even more so that it transfers in blocks, it hardly makes sense to have to save, set, transfer and restore every word.

And for a small MCU system like the teensy, this king of dedicated single purpose design has got be a not so uncommon use case.

So, I think therefore, this idea that there should be available a separate set16 call (and presumably its compliment, a set8 call), makes a lot of sense.

Might depend on the device - the 'generic' device following the protocol versus some other?

@tonton81 did a SPI Master/Slave Teensy to Teensy library with err check and other control elements where he could program both ends ... not sure what that looks like on a scope.

And IIRC: Paul reading the MCU device manual insists it says the SPI bus hardware is spec'd to work up to only 25 MHz ...

Opps <edit> 30 MHz in data sheet:
3 peripherals have SPI capability: LPSPI, FlexIO, FlexSPI

LPSPI is the normal SPI ports. Officially the datasheet says 30 MHz is the maximum, on page 67. Anything over 30 MHz is considered overclocking. Many people have reported success with SPI displays around 50 to 60 MHz.

FlexIO is complicated. Probably not much faster than LPSPI in practice.

FlexSPI is the interface used for PSRAM and flash on the bottom side of Teensy 4.1. Default speed is 88 MHz, usable range is about 49 to 132 MHz. It is designed for memory chips and probably not usable for a ADC chip.
 
Last edited:
I recall, in another discussion, somebody citing 30MHz as the "do not exceed", and that someone had it running faster than that even.
 
@joepasquariello Fantastic!!! Thank you so much. So then the hypothesis was right, the setup for 16 bits takes a big part of the 150 nsecs.

BTW those times are from the call to the start of the transfer, yes? The actual transfer is presumably still 16 bit/30 MHz ~ 533nsecs.

Whoops. Compiler optimizations got me. The actual results are transfer16() = 465 cycles (775 ns) and specialtransfer16() = 404 cycles (673 ns), so the savings is 100 ns. The additional savings from reducing the CCR delays to 0 was very small, maybe 1-2 ns. The test program and the modified code in SPI.h are shown below.

I'm not suggesting that SPI should get this modification, because it would break the paradigm of being able to manage multiple SPI devices via begin/endTransaction().

SPI.h

Code:
	void set16(void) {
		uint32_t tcr = port().TCR;
		port().TCR = (tcr & 0xfffff000) | LPSPI_TCR_FRAMESZ(15);  // turn on 16 bit mode 
	}
	uint16_t specialtransfer16(uint16_t data) {
		port().TDR = data;		// output 16 bit data.
		while ((port().RSR & LPSPI_RSR_RXEMPTY)) ;	// wait while the RSR fifo is empty...
		return port().RDR;
	}


sketch

Code:
#include "Arduino.h"

#include <SPI.h>

#define READPINA (CORE_PIN5_PINREG & CORE_PIN5_BITMASK)
#define SETPINB  (CORE_PIN6_PORTSET = CORE_PIN6_BITMASK)
#define CLEARPINB  (CORE_PIN6_PORTCLEAR = CORE_PIN6_BITMASK)

#define NREADOUT (1024)
uint16_t data[NREADOUT];

SPISettings spi_settings( 30000000, MSBFIRST, SPI_MODE0 );   // 30 MHz, reads 

void setup() {

  Serial.begin(9600);
  while (!Serial) {}

  pinMode( 4, OUTPUT );    // Clock for the CCD
  pinMode( 5, INPUT  );    // Jumpered to pin 4
  pinMode( 6, OUTPUT );    // CNVRT to the ADC

  analogWriteResolution( 4 );        // pwm range 4 bits, i.e. 2^4
  analogWriteFrequency( 4, 600000 ); // 600 kHz
  analogWrite( 4, 8 );               // dutycycle 50% for 2^4

  SPI.begin();                       // ADC connected to SPI
}

#define SPECIAL (1)

void loop()
{
    uint16_t *p16 = data;
    SPI.beginTransaction(spi_settings);
    if (SPECIAL)
      SPI.set16();
    
    uint32_t sum = 0;
    for (uint16_t i=0; i<NREADOUT; i++){ 

      while ( READPINA ) {}   // wait while high
      while ( !READPINA ) {}  // wait while low

      SETPINB;
      delayNanoseconds( 710 );
      CLEARPINB;

      uint32_t start = ARM_DWT_CYCCNT;
      *p16++ = SPECIAL ? SPI.specialtransfer16(0xFFFF) : SPI.transfer16(0xFFFF);
      sum += (ARM_DWT_CYCCNT - start);
    }
    uint32_t avg = (sum + NREADOUT/2) / NREADOUT;
    Serial.printf( "Average = %1lu Cycles = %1lu ns\n", avg, (uint32_t)(avg*1E9/F_CPU) );
    SPI.endTransaction();
    delay( 1000 );
}
 
Is there anything wrong with supporting a dedicated or user managed paradigm alongside of the multiple device paradigm?

After all, an MCU on a few centimeters of embedded pcb, is not a desktop. Should so much effort be spend on reducing and slowing it so that it can look like one?
 
@joepasquariello Perhaps you are right, I should have stopped at the first line. Let me rephrase.

My sincere feeling is that in an embedded environment, the default API should be close to the hardware and something like shared access, that inevitably costs extra cycles and serves a subset of the possible use cases, should be a convenience layer.
 
@joepasquariello Perhaps you are right, I should have stopped at the first line. Let me rephrase.

My sincere feeling is that in an embedded environment, the default API should be close to the hardware and something like shared access, that inevitably costs extra cycles and serves a subset of the possible use cases, should be a convenience layer.

Paul and the many contributors here do an amazing job of balancing simplicity for new users with access to all of the capability for the more advanced. Perhaps they will take your suggestion, but in the meantime, you could use the help and solutions that have been offered.

Here is an update to your test program that should provide the speedup that I documented yesterday. Instead of changes to the SPI library source, there are built-in LPSPI helper functions. I haven't run this one yet, but I think it will work, and I can test it in an hour or so. The calls to the helper functions are within the begin/endTransaction calls. You can switch between the "special" and "standard" SPI functions via the SPECIAL macro. The average timings are printed to USB Serial.

Code:
#include "Arduino.h"

#include <SPI.h>

#define READPINA (CORE_PIN5_PINREG & CORE_PIN5_BITMASK)
#define SETPINB  (CORE_PIN6_PORTSET = CORE_PIN6_BITMASK)
#define CLEARPINB  (CORE_PIN6_PORTCLEAR = CORE_PIN6_BITMASK)

SPISettings spi_settings( 30000000, MSBFIRST, SPI_MODE0 );   // 30 MHz, reads 

//*************************************************************************
// LPSPI helper functions
//*************************************************************************
static inline uint16_t get_framesz( IMXRT_LPSPI_t *port ) {
  return (port->TCR & 0x00000fff) + 1;
}
//*************************************************************************
static inline void set_framesz( IMXRT_LPSPI_t *port, uint16_t nbits ) {
  port->TCR = (port->TCR & 0xfffff000) | LPSPI_TCR_FRAMESZ(nbits-1); 
}
//*************************************************************************
static inline uint16_t transfer16( IMXRT_LPSPI_t *port, uint16_t data ) {
  port->TDR = data;                         // output 16 bit data
  while (port->RSR & LPSPI_RSR_RXEMPTY) {}  // wait while RSR fifo is empty
  return port->RDR;                         // return data read
}
//*************************************************************************

void setup() {

  Serial.begin(9600);
  while (!Serial) {}

  pinMode( 4, OUTPUT );    // Clock for the CCD
  pinMode( 5, INPUT  );    // Jumpered to pin 4
  pinMode( 6, OUTPUT );    // CNVRT to the ADC

  analogWriteResolution( 4 );        // pwm range 4 bits, i.e. 2^4
  analogWriteFrequency( 4, 600000 ); // 600 kHz
  analogWrite( 4, 8 );               // dutycycle 50% for 2^4

  SPI.begin();                       // ADC connected to SPI
}

#define SPECIAL (1)
#define NREADOUT (1024)
uint16_t data[NREADOUT];
IMXRT_LPSPI_t *lpspi = &IMXRT_LPSPI4_S;

void loop()
{
  uint16_t *p16 = data;
  uint16_t saved_framesz;

  SPI.beginTransaction(spi_settings);
  if (SPECIAL) {
    saved_framesz = get_framesz( lpspi );
    set_framesz( lpspi, 16 );
  }
    
  uint32_t sum = 0;
  for (uint16_t i=0; i<NREADOUT; i++){ 

    while ( READPINA ) {}   // wait while high
    while ( !READPINA ) {}  // wait while low

    SETPINB;
    delayNanoseconds( 710 );
    CLEARPINB;

    uint32_t start = ARM_DWT_CYCCNT;
    *p16++ = SPECIAL ? transfer16( lpspi, 0xFFFF ) : SPI.transfer16( 0xFFFF );
    sum += (ARM_DWT_CYCCNT - start);
  }
  if (SPECIAL) {
    set_framesz( lpspi, saved_framesz ); // restore original
  }
  SPI.endTransaction();

  uint32_t cycles = (sum + NREADOUT/2) / NREADOUT;
  uint32_t ns = cycles * (1E9/F_CPU);
  Serial.printf( "Average = %1lu Cycles = %1lu ns\n", cycles, ns );
  delay( 1000 );
}
 
Back
Top