Really fast SPI - maybe try assembly?

Status
Not open for further replies.

ehab

Member
Trying to get the fastest possible throughput using an ADS8331 ADC with Teensy.
The main loop looks like this right now:

Code:
digitalWriteFast(ADS8331_CONVSTB_pin, LOW);  // "conversion start" pulse tells ADC to do its thing, needs to be at least 40ns wide
digitalWriteFast(ADS8331_CONVSTB_pin, HIGH);
digitalWriteFast(CS_adc, LOW);
// read a 16-bit data word from the ADC
//   Note, this actually reads data from the previous iteration of the loop.
//   The A-to-D conversion initiated in the current iteration of this loop is happening while we read out the last result
ch0_raw = SPI.transfer16(0x0000);
digitalWriteFast(CS_adc,  HIGH);

I'm clocking the Teensy at 96MHz and it's not fast enough.
From the falling edge of CONVSTB to the rising edge of CS_adc takes just a tad over 1.5us, I need to get that down to about 1.4us.
Here's the time breakdown, measured on the scope:
- width of the negative-going CONVSTB pulse: 80ns
- rising edge of CONVSTB to falling edge of CS_adc: 250ns
- falling edge of CS_adc to first SPI clock edge: 200ns
- 16 SPI clocks: 640ns (using 24MHz SPI clock speed, defining a higher clock speed produces the same result)
- last SPI clock edge to CS_adc rising edge: 350ns

Where's the low-hanging fruit where I can shave off a couple of hundred nanoseconds?
- Replace the digitalWriteFast commands with something faster? Inline assembly maybe? (never tried that but up for it)
- Try to rewrite the SPI.transfer16 function to make it faster? (my C/C++ is extremely rusty but maybe this is my opportunity to scrape off the rust)
- Something else?

I'm relatively new to Teensy and Arduino, any pointers much appreciated.
 
If you don't need USB to work, you can overclock T3.1 faster, 120 MHz or more. Some discussion here:
https://forum.pjrc.com/threads/25755-Teensy-3-1-overclock-to-120MHz
You can get 120 MHz by uncommenting the relevant lines in boards.txt
it is not available by default because it 1) USB will not work and 2) it is an overclock and "usually" works but not guaranteed

EDIT: Apparently I'm wrong about USB. I was sure I read that somewhere, oh well.
 
Last edited:
I don't believe you can do better than digitalWrite fast unless your use case has multiple pins being toggled in which case a lookup table may shave a bit. I'd be saying rather you need to be configuring SPI for the highest possible clock speed, which may not in fact be the highest CPU speed. What I'm not finding at the moment was the forum posts discussing that.

There is also
https://github.com/xxxajk/spi4teensy3

Which is now widely used as a basis for Teensy SPI code and I understand is making full use of the onboard hardware.
 
Last edited:
Where's the low-hanging fruit where I can shave off a couple of hundred nanoseconds?

Overclocking to 120 MHz will allow 30 MHz SPI clock, and speed it'll everything else up by maybe 10 to 20%.

Using FASTRUN will put the entire function into RAM, which may or may not be any faster. It avoids the wait states for accessing flash, but the flash has a cache and prefetch buffer that eliminates most waiting anyway. Still, FASTRUN sometimes helps. The more you overclock, the more FASTRUN matters, because the RAM always runs at CPU speed. Testing has shown the flash doesn't overclock well. 120 MHz mode overclocks the CPU but not the flash.

Those are the easy low-hanging fruit.

Here's a couple less easy ideas:

You might shave some time off by using the SPI port to generate the CS signal automatically. You'll need to use SPIFIFO or manipulate the registers directly. That will eliminate the need to use digitalWriteFast() twice for the CS pin. Just write to the appropriate bits in the PUSHR register, and of course configure the pin ahead of time to be controlled by SPI instead of GPIO, and the SPI port will automatically drive the pin low as the transfer begins, and drive it high at the end. There's little overhead, other than writing different different data to the PUSHR register (which might be a net loss in some cases, depending on the surrounding code and how the compiler obtains the constant).

If the start pulse only needs rising or falling edge to cause the conversion, and if you're going to end up accessing the SPI registers anyway, you might try doing only a single digitalWriteFast() before you begin sending the SPI transfer. Then you can do the other edge, which presumably the chip ignores, while you're waiting for the SPI port to churn out those 16 clocks.

If you're storing the data to RAM or doing something else with it, maybe you can process the previous reading while you wait for the SPI.
 
Last edited:
Can I just mention that the online technical support for Teensy is the best I have ever experienced, for any hardware or software product of any kind?
 
Wow, thanks!
Lots of good stuff to try out.

One thing that caught my attention was the varying delays between successive calls to digitalWriteFast.
Between the first two lines of the code snippet I posted, it's 80ns, but 250ns between the 2nd and 3rd lines.
If both were 80ns I'd have reached my goal already - loop speed would be limited by the ADC conversion time, not the Teensy.
Wondering why it takes longer between lines 2 and 3?

Also, browsing the MK20DX256VLH7 documentation, it mentions "bit banding" where certain register addresses serve as aliases for single bits of other registers (including GPIOs I assume) so that changing a single output pin can be done in a single write instruction, rather than a read-modify-write. Does digitalWriteFast already take advantage of this feature?

If you're storing the data to RAM or doing something else with it, maybe you can process the previous reading while you wait for the SPI.
To clarify, the ADS8331 has 4 input pins but only 1 sample & hold and ADC, and I need to grab the values for the 4 channels "simultaneously", or as close as I can get to simultaneous. So the main loop has 4 iterations of what I posted (and the delay between those is most critical to minimize) followed by processing.

mortonkopf: I've been looking for a tutorial like that link you posted, not just for this project, thanks!
 
Just to report back from the orchard of low hanging fruit:
Using FASTRUN reduces the time from rising edge of CONVSTB to falling edge of CS_adc, from ~220ns (yes I know I wrote 250 before but 220 is more accurate) to 180ns, and has no effect on other timing.

But also... replacing
Code:
int CS_adc      = 10; // pin 10 used as ADC chip select
with
Code:
#define CS_adc 10
further reduces that to 80ns.
Doing the same for ADS8331_CONVSTB_pin makes things even faster.

Yeah that was probably bleeding obvious to all here except me - did I mention I'm just a big noob? :D
 
Noob, eh? ... which is why you have a 'scope and are measuring intervals down to 80ns. Uh-huh. ;)

I can spot a ringer! :D

Besides, if you're a Noob, what does that make me? A sub-noob?
 
Last edited:
One of the benefits of C++ is that fewer preprocessor #define directives are needed than back in the old days. In this case, it's so simple it doesn't matter, but as a style point, you might try adding a const declaration to your CS_adc directive instead of using the preprocessor. That should give the compiler enough information to produce the same code. It's worth moving away from preprocessor directives because they are error-prone.
Code:
const int CS_adc = 10;
 
Thanks again for your help so far.
Working OK on the breadboard now, until I got to my next challenge: add 2 metres of cable between the Teensy and the ADC.
All the SPI signals to and from the ADC, as well as chip select, reset, and "conversion start" signal, need to go through 2m cable, and analog and digital supply (5V / 3.3V respectively) and corresponding grounds are supplied from the Teensy thru that same cable.
It messes things up pretty good :D
Time to learn about signal integrity, I guess!

Here's what I tried so far:
- looked at the signals on the scope - they look recognizable but in pretty bad shape; no doubt there are glitches
- added 1kOhm load resistors as "termination" to the SPI signals - does not help
- looked at ground bounce (scope probe "ground" goes to "ground" on one side of the cable, scope probe tip to the "same" "ground" on the other side of the cable) - yeah it's horrendous
- added a nice short, low-resistance wire between the 2 "grounds" - does eliminate ground bounce but data coming back is still corrupted
- ruled out the cheapo shielding method where you use a flat cable and make sure that any 2 signals are separated by a ground wire - can't use flat cable, need something more robust
- looked at some datasheets for line drivers / receivers on Digikey; but got the feeling that adding something like this may not be necessary, or worse, may not even solve the problem, because the real problem here is that I don't know what I'm doing (OK I have a vague idea about the theory of signal integrity, but no practical experience to speak of)
- concluded that "shielded" cable with one common shield over all the conductors probably won't help because it won't reduce crosstalk between the conductors

SPI clock speed is only 24MHz, so I reckon this would not be a very hard problem for someone with some actual experience in signal integrity. If you happen to be such a person, I shall be glad to hear your suggestion(s). Thanks!
 
Yikes, SPI at 24 MHz over TWO METERS. That's just crazy talk.

Honestly, I'd be pretty amazed if *anything* could make this work reliably. Two meters is far too long.
 
You're probably going to have to located the Teensy close to that ADC chip. There are ways to transmit data 2 meters. USB can do 2 meters pretty well. SPI can't.
 
Thanks all.
After reading the app note that Headroom linked to, I'm looking at RS-422 transceivers. Any opinions re. 24MHz SPI protocol over 2m using an RS-422 physical layer? Still unsure about cable type - generic "printer cable" (DB-25 connector at each end) would be nice but maybe it would need a bunch of twisted pairs inside a common shield (not sure what those are called)

You're probably going to have to located the Teensy close to that ADC chip
The analog sensors and ADC sit on a board that has to be very small because it needs to go into a very small space - a Teensy won't fit there. I guess my other options is the opposite - move the ADC off the small board, onto the bigger board where the Teensy is located. Not very attractive because then I need to keep 4 analog signals nice and clean over 2m in a noisy environment. Although, if SPI over 2m gets too painful I might still go that way.
 
Any opinions re. 24MHz SPI protocol over 2m using an RS-422 physical layer?

If you only transmit, using MOSI, SCK and CS, but *not* MISO, then maybe you can get the propagation delay of 3 transmitters and 3 receivers matched closely enough.

When you throw MISO into the mix, propagation delays become much harder.
 
Can I just mention that the online technical support for Teensy is the best I have ever experienced, for any hardware or software product of any kind?
+10^6 :)

BTW, I lied....
The analog sensors and ADC sit on a board that has to be very small because it needs to go into a very small space - a Teensy won't fit there
A Teensy WILL in fact fit ABOVE the circuitry that's already there. So we'll go vertical, put a Teensy there to talk to the ADC and do the fast signal processing, but also leave the Teensy on the other board, and have them talk to each other over UART or such at a much lower data rate. Moar Teensies!
 
I'm also interested in bursts of fast (> 1 Msps) AD on a teensy 3. Some ideas: use an external ADC chip with a parallel interface. Or a small, low cost, fast ADC MCU that could act as a co-processor and buffer the data (PIC24FJ64GC006?).
 
Last edited:
Hi jonr,

use an external ADC chip with a parallel interface
I also looked at that option but after browsing Digikey I found that all the ADCs with parallel interfaces were either too slow for my needs, or crazy fast so the Teensy would have no chance of keeping up, or unsuitable for me in some other way. If you do go down that route it would be interesting to see which ADC you choose.
 
....crazy fast so the Teensy would have no chance of keeping up,..

Of course a fast ADC can always be operated at slower speeds. As Paul once mentioned, if you pick the right input lines, the teensy can read 12 parallel bits in a single read. So anything less than perhaps 10 Msps should be fine if you just put the data into an array. Faster if you use DMA.

Pricing is odd - complete MCUs with fast AD cost less than equally fast standalone ADCs.
 
Last edited:
jonr: IIRC another issue with blazing fast parallel-output ADCs was that their digital outputs used LVDS signalling. Oh yeah, and pricing! I suspect the ones I saw may have been designed for use in oscilloscopes and such.
 
Here's something I should have done at the very beginning - testing the throughput of the Teensy's built-in ADCs (assuming 4 ADC channels)
Full code below

Code:
// Speed test for internal ADCs - if fast enough, we don't need external ADS8331

// Pin asignment
const int AIN0 = 14;
const int AIN1 = 15;
const int AIN2 = 16;
const int AIN3 = 17;
const int SCOPE_SYNC_pin = 2;

// Variables
uint16_t ch0_raw, ch1_raw, ch2_raw, ch3_raw;

void setup() {
  // put your setup code here, to run once:
  pinMode(SCOPE_SYNC_pin, OUTPUT);
}

FASTRUN void loop() { 

  digitalWriteFast(SCOPE_SYNC_pin, HIGH);
  ch0_raw = analogRead(AIN0);
  ch1_raw = analogRead(AIN1);
  ch2_raw = analogRead(AIN2);
  ch3_raw = analogRead(AIN3);
  digitalWriteFast(SCOPE_SYNC_pin, LOW);

}

I'm probing pins 2 (SCOPE_SYNC_pin).
It's jittery. The delay between consecutive rising edges (i.e. time it takes to do 4 A/D conversions) jumps between 40us and about 37us.
Slower than I expected - I guess there are probably ways to speed this up.
Anyone know of further speed-up options worth looking into?
I've got it down to 8us for 4 channels, using the external ADC - so anything less than a 5x speed improvement won't do me any good, though it may still help others reading here of course.
 
Surround it with a while(1) loop, rather than allowing loop() to return. Between each run of loop(), some other code is run to do things like serialEvent() calls. That stuff is probably responsible for the occasional 37 us issue you're seeing.

Of course, if you never return from loop() and never use delay() or blocking I/O functions, things like serialEvent will not work. Hardly seems like an issue here, but I'm mentioning it for the sake of others who find this info later by searching.
 
Status
Not open for further replies.
Back
Top