DMA SPI on T3.2

Status
Not open for further replies.

gfvalvo

Well-known member
Hi all. I decided to try driving an APA102 LED strip using DMA-based SPI on a T3.2. I’m using the standard SPI library that comes with Teensyduino. My code loads up the DMA buffer, calls ‘SPI.beginTransaction()’, and then calls this overload of ‘SPI.transfer()’:

Code:
bool transfer(const void *txBuffer, void *rxBuffer, size_t count,  EventResponderRef  event_responder);

The SPI parameters are set up for 24 MHz clock and things work pretty much as expected. The SPI.transfer() call returns very quickly and DMA autonomously blasts the data out to the LED strip while my code can do other stuff at the same time. Pretty cool.

Only one issue -- my timing measurements showed that the DMA transfer was taking longer than expected given the SPI clock rate and number of bytes transferred. So, I put the SPI Clock signal on a scope. This indeed showed the SPI Clock running at the full 24 MHz, but it was bursty. There are always 8 clock pulses at 24 MHz (a single byte being transferred), but each group is separated by a time gap. The gaps vary some, but they’re all on the order of ~200ns each.

So, finally, the question -- before I dig into this further, is this characteristic just the way it is with DMA SPI on a T3.2? Or, perhaps, is the DMA or SPI peripheral not being set up optimally by the standard SPI library? Or, could the DMA be being forced in to wait states by other memory bus activity? That seems unlikely to me given how regular the gaps are.

Thanks in advance.
 
With SPI, there is typically a gap between byte or word transfers. So yes you would see a gap after each 8 bit entry is output.

The gaps can be a shorter gap or a longer gap. In particular, what I mean is that the PUSHR register on the T3.x has a bit (SPI_PUSHR_CONT), which when is set, implies that we are going to continue to output data, so gap is shorter. If this bit is not set, the gap will be longer...

Why - can probably skip - The CONT bit is used to say that we are going to continue to output another byte and as such the CS is left asserted. Without this bit set, it will logically de-assert CS (potentially more than one depending on other bits) and as such the timing is such to leave a before change CS delay and after change delay... Longer gap...

I believe the SPI DMA code is setup to hopefully use CONT for all of the transfers. I know I checked it when implementing. That is just before we begin the outputs, we output the first byte by:
port().PUSHR = dma_first_byte | SPI_PUSHR_CTAS(0) | SPI_PUSHR_CONT;

Could go into more details...
More details are in the T3.2 reference manual in SPI chapter. Taks a look at SPIx_CTARn
The settings of many of these settings are done when you do the beginTransaction... Actually most times it is done precompiled, when you use constants.
But if you take a look at the class SPISettings (about line 339...) you will see what values are set depending on CPU F_BUS speed and desired SPI clock...
 
Thanks @KurtE. For my own edification, I'll dig into the SPI and DMA sections of the chip manual again. It's been a while. As far as you know, would a ~200ns gap be in the ballpark for 24MHz clock rate? Any chance it can be reduced?
 
Here is a sample program I ran back then... Slightly modified to your setting...
Code:
//==============================================================
// SPI Master quick test - DMA version

#include <SPI.h>

EventResponder event;
uint8_t  xxx[16];
uint8_t  yyy[16];

void asyncEventResponder(EventResponderRef event_responder) {
  digitalWriteFast(10, HIGH);
  SPI.endTransaction();
  Serial.print("YYY: ");
  for (uint8_t i = 0; i < sizeof(yyy); i++) Serial.printf("%02x ", yyy[i]);
  Serial.println();
}
void setup() {
  while (!Serial && (millis() < 2000)) ;
  Serial.println("Test SPI DMA master");
  //SPI.setMISO(8);
  //SPI.setMOSI(7);
  //SPI.setSCK(14);
  pinMode(10, OUTPUT);
  digitalWriteFast(10, HIGH);
  SPI.begin();
  event.attachImmediate(&asyncEventResponder);
  for (uint8_t i = 0; i < sizeof(xxx); i++) {
    xxx[i] = i;
  }
}

void loop() {
  SPI.beginTransaction(SPISettings(24000000, MSBFIRST, SPI_MODE0));
  digitalWriteFast(10, LOW);
  SPI.transfer(xxx, yyy, sizeof(xxx), event);
  Serial.printf("CTAR0: %x\n", KINETISK_SPI0.CTAR0);
  delay(1000);
}

Here is Logic Analyzer showing output:
screenshot.jpg

Ok now the output for CTAR0 shows: CTAR0: b8000000
Which translates to
DBR - Double Baud Rate
FMSZ = 7 or 8 bits per frame

Which translates to 50/50 clock cycle
Also PCSSCK = 0 or Prescaller value of 1 for SCK delay (delay between assertion of PCS and the first edge of the S)
and PASC = 0 so Time after the last edge of SCK and the negation of PCS or again 1 clock time...
PDT=0 ...

Again I am not seeing any way to shorten this, as all of these values are 0...

Verified calculation
FBUS: 48000000

So this would imply in calculating CTAR0 that: if (clock >= F_BUS / 2) or 24000000 >= (48000000/2) should be true so
c = SPI_BR_SPPR(0) | SPI_BR_SPR(0);
 
Hi @KurtE. So, upon further examination of the datasheet, things don't seem to be adding up. More specifically, they're not adding up to the amount of inter-frame gap shown in both of our measurements.

Figure 45-74 of the "K20 Sub-Family Reference Manual" shows that with CONT = 1, the gap should be TASC + TCSC.

From the CTAR0 value reported, we can compute:
TSAC = PASC x ASC = 1 x 2 = 2 cycles of Fsys
TCSC = PCSSCK x CSSCK = 1 x 2 cycles of Fsys

So, that's a total gap of 4 cycles of Fsys.

But the 208 ns gap shown in both of our measurements represents 10 cycles of Fsys (assuming Fsys is 48 MHz on at 96 MHz T3.2).

So, where are the extra 6 cycles of gap delay coming from?

Thanks again for your time.
 
So, the plot thickens. I switched over to the non-DMA overload of ‘transfer()’:
Code:
void transfer(const void * buf, void * retbuf, size_t count);
After this, two very interesting things happened:

First, the inter-frame gap dropped to the expected value of ~83ns (4 clock cycles at 48Mhz).

Second, the transfers were done in 16-bit mode. This, of course, cut the number of inter-frame gaps in half -- further improving the total transfer time. An aside --- Since I was sending an odd number of bytes, there was actually one 8-bit transfer at the start. All remaining transfers were 16-bit.

My two conclusions from this exercise are:

* The DMA technique adds a large (and highly variable) delay to the inter-frame gap.

* When you really want to blast out the SPI data quickly (ex. for driving APA102 LED strips), use 16-bit transfers.

The second conclusion was obvious once I thought about it. The first conclusion is disappointing. Guess I’ll go reread the DMA section and see if there’s anything that can be done about it.
 
Yep - The non-dma version, we put in the code to pack and unpack two 8 bit values into a 16 bit value as to speed things up. Note: I originally had some 16 bit versions of these apis as well.

When you say variable delay times, I would expect that for the most part they should be the same... Except maybe first to second byte (as the first byte I spoon feed it outside of DMA) an then turn on DMA... Why (Need to setup the upper word of PUSHR with information like which CTAR to use, baud, CONT... Also if you try to transfer over 32767 bytes it will limit each transfer to 32767 and when it receives an interrupt that that those bytes have transferred it will startup the next transfer... So could/should have larger gap...

Again not sure why it would be more of gap for DMA... Unless maybe DMA is not actually filling the FIFO queue, but waiting for queue to be empty and then queue the next? I don't think I did anything to disable FIFO, but...

That is void transfer16(const void * buf, void * retbuf, size_t count);
And likewise the DMA version... But was decided to not include them...
 
I haven’t carefully quantified it yet, but the gap does vary when using DMA. Seemed to range from high 100’s to low 200’s (ns). I was only sending 151 bytes, so 32767 byte threshold did not come into play.

I’ll look into the DMA code more. Might end up customizing an APA102-only version. Such applications don’t have any other devices on the bus (they don’t have Slave Select). Thus, no SPI Transaction begin/end required. Just grab the bus and keep it. Also, perhaps 16-bit transactions only -- I can pad an extra byte on the end frame if required for even byte count. And, hopefully, get rid of the DMA-induced extra delay.
 
Let me know if you find out anything interesting... Like ways to update the library to speed it up
 
Well, I think I have an explanation for the noted behavior - at least a “Handwaving” one. It appears to me that the upper 16 bits of the PUSHR register (the command word) are not “sticky”. By this I mean that if you first do a 32-bit write to it (programmatically) such as:
Code:
port().PUSHR = dma_first_byte | SPI_PUSHR_CTAS(0) | SPI_PUSHR_CONT;
then the command word will only be valid for that SPI frame. If you subsequently do an 8-bit or 16-bit write (such as the DMA operation will do), then instead of staying at its current value, the command word will revert to 0x0000. Thus, the SPI_PUSHR_CONT bit will become unset.

I first noticed this when I tried to use CTAR1 which was set for 16-bit SPI frames:
Code:
port().PUSHR = dma_first_byte | SPI_PUSHR_CTAS(1) | SPI_PUSHR_CONT;
The first frame was indeed 16 bits, but subsequent DMA-triggered, frames were only 8 bits. This told me that the control word’s SPI_PUSHR_CTAS field switched back to CTAR0.

I’ve verified this behavior with both DMA and programmatic transfers.

I’ve scoured the datasheet’s SPI chapter looking for any setting that might make the command word “sticky”. No luck. My next step will be to write some DMA code that does 32-bit transfers to PUSHR so that every write includes the control word. This will, of course, require the buffer to be twice as big as it needs to be. But, I’m curious what will happen.
 
Yep - That is why you will notice that in some cases like in the ILI9341_t3 library, when we are doing multiple 16 bit writes, we muck with the CTAR0 to turn it into 16 bits... So that we can make it work correctly...
And to make life more interesting, T3.6 it works properly where we can actually preserve the upper...

I think I tried the 32 bit writes at some earlier date... Could be wrong... I know I did not try it for the ILI9341_t3n code base as not enough memory to hold smaller version of buffer let alone double the memory size.

Although maybe could do like I have hacked up for ILI9488, where when we do DMA, I have two secondary buffers, where I expand the data (in this case from 8 bit pallet indexes) into 32 bit color values and then DMA those values. I have the two buffers linked, when I get an interrupt saying the transfer of a buffer completes, I then fill it with the next part...
 
I’ll probably try to link 2 DMA channels to do the SPI TX part.

The first channel will trigger from the SPI TX FIFO Not Full (TFFF) and will do a 16-bit transfer from the next buffer location to the low-order bytes of a intermediate 32-bit variable in a fixed location. The upper 16 bits of this variable will have a constant value representing the desired SPI command word.

The second DMA channel will trigger every time the first one completes a transfer. It will perform a 32-bit transfer from this intermediate location to PUSHR.

The end goal here is to DMA blast data to APA102 LEDs as quickly as possible while being able to set up the next round of LED data at the same time. Think PoV display.

Will report back.
 
I’ll probably try to link 2 DMA channels to do the SPI TX part.

The first channel will trigger from the SPI TX FIFO Not Full (TFFF) and will do a 16-bit transfer from the next buffer location to the low-order bytes of a intermediate 32-bit variable in a fixed location. The upper 16 bits of this variable will have a constant value representing the desired SPI command word.

The second DMA channel will trigger every time the first one completes a transfer. It will perform a 32-bit transfer from this intermediate location to PUSHR.

The end goal here is to DMA blast data to APA102 LEDs as quickly as possible while being able to set up the next round of LED data at the same time. Think PoV display.

Will report back.

I wonder whether using DMA will actually speed up APA102 LEDs. But it would be useful to find out. I recalled there was an issue with APA102 leds going too fast, and I found this article from Paul Stoffregen:
 
I wonder whether using DMA will actually speed up APA102 LEDs. But it would be useful to find out. I recalled there was an issue with APA102 leds going too fast, and I found this article from Paul Stoffregen:

I'm currently driving APA102 LEDs programmaticly with 24 MHz SPI (fastest possible on T3.2) -- no issues. The advantage of DMA is being able to update a buffer with the next display (ping-pong style) while the current one is being pushed out. Same concept as PJRC OctoWS2811 library.
 
Status
Not open for further replies.
Back
Top