SPI: important difference between Arduino Atmel and ARM

Status
Not open for further replies.

NikoTeen

Well-known member
hi,
there is an important difference between Arduino Atmel controllers and ARM controllers:
When porting a software using SPI interface for a SD card from Mega2560 to Teensy3.2 the transfer of data on Teensy with 96 MHz was much slowlier than on Mega2560 with 16 MHz.
The reason is that the SPCR register of Teensy (MK20DX256) cannot be read, it is only writeable.

The initial software to configure the SPI interface:
Code:
SPCR = (0 << SPIE) | /* SPI Interrupt Enable */
           (1 << SPE)  | /* SPI Enable */
           (0 << DORD) | /* Data Order: MSB first */
           (1 << MSTR) | /* Master mode */
           (0 << CPOL) | /* Clock Polarity: SCK low when idle */
           (0 << CPHA) | /* Clock Phase: sample on rising SCK edge */
           (1 << SPR1) | /* Clock Frequency: f_OSC / 128  */ 
           (1 << SPR0);
    SPSR &= ~(1 << SPI2X); /* No doubled clock frequency */
...
void SD_L0_SpiSetHighSpeed(void)
{
    SPCR &= ~((1 << SPR1) | (1 << SPR0)); /* Clock Frequency: f_OSC / 4 */;
    SPSR |= (1 << SPI2X);         /* Doubled Clock Frequency: f_OSC / 2 */
}
At first the SPCR register is set to 0x53 which defines a very low clock frequency for the SPI interface.
After some initialization steps the function SD_L0_SpiSetHighSpeed(void) is called which shall increase the clock frequency.
On the Teensy3.2 the SPCR register had the same value after calling SD_L0_SpiSetHighSpeed() as before. The value of SPSR was changed according to the code. Therefore the SPI interface was running with quite a low freqency.

Saving the content of SPCR to a global variable, changing its value there and then writing it to the register solved the problem.
 
Teensy ARM appropriate SPI library code is here: {Arduino ...}\hardware\teensy\avr\libraries\SPI\SPI.cpp.

The only time I got near it was working in this code that shows usage of it: {Arduino ...}\hardware\teensy\avr\libraries\XPT2046_Touchscreen\XPT2046_Touchscreen.cpp
There look at: bool XPT2046_Touchscreen::begin() and void XPT2046_Touchscreen::update()

There are other libraries that do more elaborate usage you may get feedback on - but that is the one I know of and successfully implemented changes to.
 
Saving the content of SPCR to a global variable, changing its value there and then writing it to the register solved the problem.
I believe the Teensy way of doing this is to use beginTransaction, which allows you to set the SPI speed, big/little bit ordering, and SPI mode with every SPI transaction.

Due to the differences in the underlying machine, it is probably better not to rely on Paul's re-implementation of the AVR registers, and use either the higher level functions or write completely new Teensy specific machine code.
 
Agreed, the proper way is to use the SPI library. But there's still quite a lot of very old AVR code out there....

I've added these missing AVR SPCR emulation features. Hopefully this makes everything work for you?

https://github.com/PaulStoffregen/cores/commit/8f137ff3fff2f230de02e466d45cc86a3f52f507

Here's the complete sketch I used for testing.

Code:
void setup() {
  pinMode(20, OUTPUT); // for comparison to Teensy++ 2.0
  pinMode(21, OUTPUT);
  pinMode(22, OUTPUT);
  digitalWrite(20, HIGH);
  SPCR = (0 << SPIE) | /* SPI Interrupt Enable */
         (1 << SPE)  | /* SPI Enable */
         (0 << DORD) | /* Data Order: MSB first */
         (1 << MSTR) | /* Master mode */
         (0 << CPOL) | /* Clock Polarity: SCK low when idle */
         (0 << CPHA) | /* Clock Phase: sample on rising SCK edge */
         (1 << SPR1) | /* Clock Frequency: f_OSC / 128  */ 
         (1 << SPR0);
  SPSR &= ~(1 << SPI2X); /* No doubled clock frequency */
  
  SPCR &= ~((1 << SPR1) | (1 << SPR0)); /* Clock Frequency: f_OSC / 4 */;
  SPSR |= (1 << SPI2X);         /* Doubled Clock Frequency: f_OSC / 2 */
}

void loop() {
  SPDR = 0x5A;
  while (!(SPSR & _BV(SPIF))) ; // wait
  SPDR;
  delay(10);
}

If there's still anything not properly emulated, please post a complete sketch for testing.
 
Here's the complete sketch I used for testing.

Code:
void setup() {
  pinMode(20, OUTPUT); // for comparison to Teensy++ 2.0
  pinMode(21, OUTPUT);
  pinMode(22, OUTPUT);
  digitalWrite(20, HIGH);
  SPCR = (0 << SPIE) | /* SPI Interrupt Enable */
         (1 << SPE)  | /* SPI Enable */
         (0 << DORD) | /* Data Order: MSB first */
         (1 << MSTR) | /* Master mode */
         (0 << CPOL) | /* Clock Polarity: SCK low when idle */
         (0 << CPHA) | /* Clock Phase: sample on rising SCK edge */
         (1 << SPR1) | /* Clock Frequency: f_OSC / 128  */ 
         (1 << SPR0);
  SPSR &= ~(1 << SPI2X); /* No doubled clock frequency */
  
  SPCR &= ~((1 << SPR1) | (1 << SPR0)); /* Clock Frequency: f_OSC / 4 */;
  SPSR |= (1 << SPI2X);         /* Doubled Clock Frequency: f_OSC / 2 */
}

void loop() {
  SPDR = 0x5A;
  while (!(SPSR & _BV(SPIF))) ; // wait
  SPDR;
  delay(10);
}

If there's still anything not properly emulated, please post a complete sketch for testing.
That is exactly the code I also use with some minor changes.
- The code within setupu() is contained in a function SD_L0_Init(void)
- The last 2 lines of setup() are shifted to a further function SD_L0_SpiSetHighSpeed(void)
- The content of SPCR is saved to a global Variable REGspcr, and changes are done first in this variable and then copied to SPCR. Without this saving and then copying the speed was very slow.
So the overall code for my tests is:
Code:
 uint8_t SD_L0_CSPin=10;    // for Teensy 3.2
SD_L0_SetCSHigh()	{ digitalWrite(SD_L0_CSPin, HIGH); }

void SD_L0_Init(void)
{	/* Setup ports */
   pinMode(SD_L0_CSPin, OUTPUT);
   SD_L0_SetCSHigh();
  
   pinMode(MISO, INPUT);
   pinMode(SCK, OUTPUT);
   pinMode(MOSI, OUTPUT);
   pinMode(SS, OUTPUT);
  
   digitalWrite(SCK, LOW);
   digitalWrite(MOSI, LOW);
   digitalWrite(SS, HIGH);
 
    /* Powering up takes at least 500us for capacitors to charge */
    // not for arduino

    /* initialize SPI with lowest frequency; max. 400kHz during identification mode of card */
    REGspcr = (0 << SPIE) | /* SPI Interrupt Enable */
           (1 << SPE)  | /* SPI Enable */
           (0 << DORD) | /* Data Order: MSB first */
           (1 << MSTR) | /* Master mode */
           (0 << CPOL) | /* Clock Polarity: SCK low when idle */
           (0 << CPHA) | /* Clock Phase: sample on rising SCK edge */
           (1 << SPR1) | /* Clock Frequency: f_OSC / 128 */ 
           (1 << SPR0);
	SPCR = REGspcr;
    SPSR &= ~(1 << SPI2X); /* No doubled clock frequency */

}

void SD_L0_SpiSetHighSpeed(void)
{    REGspcr &= ~((1 << SPR1) | (1 << SPR0)); /* Clock Frequency: f_OSC / 4 */
	SPCR = REGspcr;
    SPSR |= (1 << SPI2X);         /* Doubled Clock Frequency: f_OSC / 2 */
}

void setup(void) {
SD_L0_Init();
SD_L0_SpiSetHighSpeed();
}

void loop(void) {
the same code as above
}
 
Other people might disagree with me but ... from my experience ... even fiddling with those register values, you're still not going to get SPI data transfers as fast as on a 16-MHz Arduino. At least if you are doing "single" byte transfers. I don't think it's necessarily an ARM problem per se, but rather the fact that the lone SPI port on the T3.1,3.2 has a 4-byte FIFO, and there is some weird business going on in the internal hardware of the port.

What I found, albeit this was 2 years ago ago or so (so remembrance may not be perfect), whenever you do a "single-byte" xfer there is always a latency of approx 500-nsec before the xfer takes place. I measured this on an oscilloscope, and this means that, no matter what, the throughput will not be better than about 1-byte every 1-usec. Other people may have had different results, but that was my experience.

OTOH, if you are using software that supports the FIFO xfer mode, you can get **phenomenally** fast SPI updates that makes it look like the Arduino is asleep, see here:

https://dorkbotpdx.org/blog/paul/display_spi_optimization

There is another issue, which is trying to operate multiple SPI peripherals from the single SPI port on the T3.1,3.2 modules. I had absolutely ZERO luck with this, myself. I had code from the mega2560 that was successfully controlling 4 SPI peripherals simultaneously, but could never get it to work on the T3.1. You will note that Paul implemented SPI transactions to help deal with the issue of multiple peripherals (see below), but I never had much luck with that myself. (I admit I'm really sort of stupid with some things).

https://www.pjrc.com/teensy/td_libs_SPI.html
- SPI.beginTransaction(SPISettings(clockspeed, MSBFIRST, SPI_MODE0))
- SPI.endTransaction()

Basically, I pretty much gave up on trying to do multiple things with the single SPI port on the T3.1,3.2 modules, and waited till the T3.5,3.6 arrived to do so. IE, multiple SPI ports and dedicated SD socket, makes life much easier.
 
Other people might disagree with me but ... from my experience ... even fiddling with those register values, you're still not going to get SPI data transfers as fast as on a 16-MHz Arduino.

As a quick sanity check, I ran this code on both Arduino Uno and Teensy 3.2.

Code:
#include <SPI.h>

void setup() {
  SPI.begin();
  pinMode(10, OUTPUT);
}

void loop() {
  while (1) {
    SPI.beginTransaction(SPISettings(8000000, MSBFIRST, SPI_MODE0));
    //digitalWrite(10, LOW);
    PORTB &= ~(1<<2);
    SPI.transfer(0x5A);
    SPI.transfer(0x20);
    SPI.transfer(0x99);
    //digitalWrite(10, HIGH);
    PORTB |= (1<<2);
    SPI.endTransaction();
  }
}

As you can see, I used register writes for the chip select to try to be "fair" to Uno. I tried this initially with digitalWrite(), but the result on Uno is incredibly slow. Fortunately, Teensy has a software layer to emulate these AVR registers, so the exact same code can run on both.

Here is the result for Arduino Uno:

file.png

And here is the result for Teensy 3.2:

file.png

I believe it's pretty easy to see 32 bit 96 MHz Teensy 3.2 is substantially faster than 8 bit 16 MHz Arduino Uno. When avoiding digitalWrite, Uno does pretty well with its slow CPU, but it's certainly not faster than Teensy.

Of course, Teensy 3.2 can use much faster SPI clock speeds. The overhead of doing 1-byte-at-a-time does become pretty significant. Here's the exact same test, but with the SPI clock speed increased to 24 MHz. The FIFO really is needed to fully leverage such faster SPI speeds.

file.png
 
Last edited:
Other people might disagree with me but ... from my experience ... even fiddling with those register values, you're still not going to get SPI data transfers as fast as on a 16-MHz Arduino. At least if you are doing "single" byte transfers. I don't think it's necessarily an ARM problem per se, but rather the fact that the lone SPI port on the T3.1,3.2 has a 4-byte FIFO, and there is some weird business going on in the internal hardware of the port.
What I have found is it is not necessarily anything to do with using the queue, but how you setup and use the PUSHR register. That is you can encode what happens with the Chip select pins associated with the SPI buss (even if you actually use any of these CS pins)

With this ability the system has built in delays between when the CS pin(s) are asserted and/or deasserted (even if your push does not touch any CS pins). which in the default case is the actually the time you are seeing between bytes when you do multiple PUSHR instructions.

You can avoid these delays by using the CONT bit in the PUSHR register. That is why for example the SPI.transfer(buffers, cnt) or SPI.transfer(buffer, rxbuffer, cnt) actually sets up to use the CONT for all but the last byte/word sent... We also try to make better use of the queue by packing byte transfer into words...

Edit: Forgot to mention, of course if you are actually using SPI.transfer(x) type calls, the code has to be in lock step to wait for the transfer to complete before it returns, as to be able to return the value that was returned.
 
Wondering how slowly the Teensy can run and match the UNO? 24 MHz? 16 MHz?

Good question. Turns out 24 MHz pretty closely matches Uno's performance at 16 MHz. The time between the individual transfers is still a little faster, but Teensy has more overhead in the transactions functions, which makes the net speed slightly slower but still very close.

file.png

Here's the Teensy 3.2 performance at only 16 MHz.

file.png

It definitely is slower than Uno when running at the same 16 MHz clock. Most of the difference is in the more complex transaction functions on Teensy.
 
Cool, I always love it when I create an instant firestorm, :). I was in court this week, and managed to create one there too (audible gasp from the adverse party!!).

Paul,
1. I'm sorry, but I'm not sure what the time base is on your figures. How long between major grid lines?
2. I did my stuff a couple of years ago. Could newer updates to the SPI library be more efficient for single-byte xfers?
3. Also, as I recall, I was **not** using your SPI.beginTransaction() code but the basic code shown in the example on your page, as below. Could it be using transactions goes faster?

All in all, I do remember that 500-nsec latency each time a 1-byte xfer was initiated, and it wouldn't seem to go away.

- the sort of simple code that I used:
https://www.pjrc.com/teensy/td_libs_SPI.html
Code:
#include <SPI.h>  // include the SPI library:

const int slaveSelectPin = 20;

void setup() {
  // set the slaveSelectPin as an output:
  pinMode (slaveSelectPin, OUTPUT);
  // initialize SPI:
  SPI.begin(); 
}

void loop() {
  // go through the six channels of the digital pot:
  for (int channel = 0; channel < 6; channel++) { 
    // change the resistance on this channel from min to max:
    for (int level = 0; level < 255; level++) {
      digitalPotWrite(channel, level);
      delay(10);
    }
    // wait a second at the top:
    delay(100);
    // change the resistance on this channel from max to min:
    for (int level = 0; level < 255; level++) {
      digitalPotWrite(channel, 255 - level);
      delay(10);
    }
  }
}

int digitalPotWrite(int address, int value) {
  // take the SS pin low to select the chip:
  digitalWrite(slaveSelectPin,LOW);
  //  send in the address and value via SPI:
  SPI.transfer(address);
  SPI.transfer(value);
  // take the SS pin high to de-select the chip:
  digitalWrite(slaveSelectPin,HIGH); 
}

///////////////////////////////////////////////

Kurt:
I am not surprised at what you say. I was just using the standard library commands and did not try to delve directly into the Teensy3.1 registers, like PUSHR, as you mentioned. I do remember looking at Paul's library code where he does that stuff, but did not play with it myself.

https://raw.githubusercontent.com/PaulStoffregen/SPI/master/SPI.h
https://raw.githubusercontent.com/PaulStoffregen/SPI/master/SPI.cpp
 
1. I'm sorry, but I'm not sure what the time base is on your figures. How long between major grid lines?

All those screenshots are 1 us/div.

The scope shows this (and perhaps too much) other info at the top of the screen. It also shows "3.000us" right below "1.000us/", which I personally find to be kinda confusing. The top number is the time per division and the bottom is the time offset of the trigger event from the center of the screen.

2. I did my stuff a couple of years ago. Could newer updates to the SPI library be more efficient for single-byte xfers?

Lots of stuff has improved in the library, but the single byte transfers should be pretty similar.

Kurt contributed a pretty incredible speedup for SPI.transfer(buffer, length) within the last 2 years.

3. Also, as I recall, I was **not** using your SPI.beginTransaction() code but the basic code shown in the example on your page, as below. Could it be using transactions goes faster?

No. The transaction code doesn't affect the speed of transfers, unless of course you use it to set a different SPI clock speed. But it does add some overhead. Things are simpler on AVR, so this overhead is less.
 
Might be worth noting the code in message #11 never calls any functions to set up the SPI clock speed, data order or format.

Something that *has* changed is the default configuration if you don't set up anything.
 
Cool - I managed to read the SCOPE screen right as far as time scale being the same! The difference in appearance is the High Powered UNO pushing 5V right?

Good question. Turns out 24 MHz pretty closely matches Uno's performance at 16 MHz. The time between the individual transfers is still a little faster, but Teensy has more overhead in the transactions functions, which makes the net speed slightly slower but still very close.
...

Does the PORTB write drop all the way out to the more efficient digitalWriteFast? If Not - Would using that get Teensy over the hump - at least at 24 MHz?

Code:
#include <SPI.h>

void setup() {
  SPI.begin();
  pinMode(10, OUTPUT);
}

void loop() {
  while (1) {
    SPI.beginTransaction(SPISettings(8000000, MSBFIRST, SPI_MODE0));
#ifdef TEENSYDUINO
    digitalWriteFast(10, LOW);
#else
    PORTB &= ~(1<<2);
#endif
    SPI.transfer(0x5A);
    SPI.transfer(0x20);
    SPI.transfer(0x99);
#ifdef TEENSYDUINO
    digitalWriteFast(10, HIGH);
#else
    PORTB |= (1<<2);
#endif
    SPI.endTransaction();
  }
}
 
I repeated these tests on Arduino's 32 bit boards, but I put the digitalWrite since I don't have the direct register code for those boards. Still, the dead time between each SPI transfer should be comparable.

Both Arduino Zero and Arduino Due are slightly slower than Arduino Uno!

Here is Arduino Zero:

file.png

Here is Arduino Due:

file.png

It's interesting that Due is the slowest by a slight margin, despite having the fastest processor (very similar in speed to Teensy 3.2).
 
Paul, thanks, I wondered about the 1-usec vs 3-usec thing at the top of the screen.

I think your figures illustrate my point about the latency in the T3.1,3.2 SPI xfer, although it's not quite as bad as I had remembered. However, my comment about the UNO was off-base, as it's by far the slow-poke here, and no doubt 8-MHz is the fastest SPI clock you can get with a 16-MHz xtal. However, looking at the xfer rates in your 3 figures,

- 16-MHz UNO (8-MHz clock): 3bytes/6usec = 0.50 bytes/usec
- 96-MHz T3.2 (8-MHz clock): 3bytes/4usec = 0.75 bytes/usec
- 96-MHz T3.2 (24-MHz clock): 3bytes/2.25usec = 1.33 bytes/usec

So, the 96-MHz T3.2 with 8-MHz clock is only 50% faster than the 16-MHz UNO.

Also, for the T3.2, you speeded up the clock by 3X, but the xfer rate only speeded up by 1.77X. As I see it, there is basically a latency of approx 300-nsec in the SPI port operation that is limiting the maximum xfer rate. I thought I had remembered 500-nsec, but it looks more like 300-nsec. I'm sure that latency is not in the software, as 96-MHz is flying, so it's no doubt in the SPI port hardware.

So, that's the issue with single-byte xfers. Luckily, however, we can use the FIFO and the thing just rips.

EDIT:
Paul: And as your figures show, when the FIFO is used, then the latency I have been referring to essentially goes away.
http://dorkbotpdx.org/files/images/scope2.png
https://dorkbotpdx.org/blog/paul/display_spi_optimization
 
Last edited:
Status
Not open for further replies.
Back
Top