Teensy 3.5 SPI DMA?

KurtE · Apr 26, 2017

I am about to start playing some more with my SPI update (different thread)

And want to start trying out some more DMA stuff with the 3.5. I remember having issues with SPI1 and SPI2 for DMA support.

First question, I wonder if the Docs and/or kinetis.h file is correct? That is if I look at Kinetis.h file at about line 482:

Code:

#define DMAMUX_SOURCE_SPI1_RX		16
#define DMAMUX_SOURCE_SPI1_TX		17

But if I look at the manual at section 3.3.9.1
Source 16: SPI1 - Transmit or receive
Source 17: SPI2 - Transmit or receive

Which is different.

Wonder has anyone had any success yet with trying SPI1 or 2 with DMA on this chip?

Thanks
Kurt

KurtE · Apr 26, 2017

So I am testing out my SPI test program that has some DMA examples installed, and found if I used:
The #defines as defined in the kinetis.h file for TX/RX of SPI1, the program hung out to dry.

SO I tried setting both RX and TX to 16 like the manual specified. With my new version of SPI I have the test code:

Code:

  #if defined (KINETISL) || defined(KINETISK)
  // Now try DMA...
  Serial.println("Try DMA Transfer!!!"); Serial.flush();
  _SPI.transfer(0); // See if calling with 1 byte changes things... 
  delayMicroseconds (25);
  memset(rxbuffer, 0xC5, sizeof(rxbuffer));
  start_time = micros();
  ASSERT_CS();
  our_dma_state = 1;
  _SPI.transfer( buffer, rxbuffer, sizeof(buffer), &Our_DMA_callback);
  while (our_dma_state)   ;
  Serial.println(micros() - start_time, DEC); Serial.flush();

  for (uint8_t i=0; i < sizeof(buffer); i++) {
    if (buffer[i] != rxbuffer[i]) {
      Serial.printf("Transfer mismatch(%d) %x != %x\n", i, buffer[i], rxbuffer[i]);
      Serial.printf("  %x %x %x %x %x\n", rxbuffer[i+1], rxbuffer[i+2], rxbuffer[i+3], rxbuffer[i+4], rxbuffer[i+5]);
      break;
    }
  }
  

  Serial.println("Try DMA Write!!!"); Serial.flush();
  delayMicroseconds (10);
  start_time = micros();
  ASSERT_CS();
  our_dma_state = 1;
  _SPI.transfer( buffer, NULL, sizeof(buffer), &Our_DMA_callback);
  while (our_dma_state)   ;
  Serial.println(micros() - start_time, DEC); Serial.flush();

  Serial.println("Try DMA Read!!!"); Serial.flush();
  delayMicroseconds (10);
  start_time = micros();
  ASSERT_CS();
  our_dma_state = 1;
  _SPI.transfer( NULL, rxbuffer, sizeof(buffer), &Our_DMA_callback);
  while (our_dma_state)   ;
  Serial.println(micros() - start_time, DEC); Serial.flush();

And on SPI, with MISO/MOSI jumpered, the buffers match.

On SPI1, where I have the buffer init with 0, 1, 2, 3, 4, ...

When I run it with pins 0-1 jumperred. I Try DMA Transfer!!!
DMA Callback
79
Transfer mismatch(1) 1 != 0
0 0 1 1 2

So having issue on the receive side, will dump out more information on it... But at minimal it looks like I can at least make it work for TX...

tni · Apr 26, 2017

You are out of luck. You can use DMA for either RX or TX, not both.

Teensy 3.6 has DMA mux triggers for SPI1 RX and TX (which works) and SPI2.

KurtE · Apr 26, 2017

Thanks again,

I was sort-of remembering some of that from the beta program.

Sorry if this is an obvious question (and I will look over the documents some more), But if I am using SPI with DMA and I wish to do a
Full Transfer. I then setup the two DMAChannel objects, with:
TX channel: buffer -> PUSHR
RX Channel: POPR -> rxBuffer

With a logical Write, I am doing:
TX Channel: buffer -> Pushr
RX Channel: POPR -> <one byte memory location no increment>

With logical Read I am doing
RX Channel: <one byte memory location init to our send value> -> PUSHR
TX Channel: POPR -> rxBuffer

So far it has worked OK on SPIs that have the seperate RX/TX channels... But is there a better way, to say for example in the Read
case, to use some fixed value and in logical write case to say, throw the stuff into bitbucket?

Need to relook at the flags in SR and RSER to see what happens with these different flags. I know that my DMA output for ILI9341_t3n only sets up on channel so Transmit I am sure is valid...

As For T3.6 - I believe

Code:

16 SPI1 RX 
17 SPI1 TX
38 SPI2 RX
39 SPI2 TX

Again thanks,

Now back to playing and reading (and weeding in garden)

tni · Apr 26, 2017

KurtE said:
With a logical Write, I am doing:
TX Channel: buffer -> Pushr
RX Channel: POPR -> <one byte memory location no increment>

With logical Read I am doing
RX Channel: <one byte memory location init to our send value> -> PUSHR
TX Channel: POPR -> rxBuffer

So far it has worked OK on SPIs that have the seperate RX/TX channels... But is there a better way, to say for example in the Read case, to use some fixed value

DMA needs a source location, so the 1-byte source buffer is the way to go.

and in logical write case to say, throw the stuff into bitbucket?

Just skip the RX altogether and let the RX FIFO overflow. On Kinetis K, you can easily clear it at the end (what the current code already does at the beginning of transfer()).

KurtE · Apr 26, 2017

Thanks, that is what I was thinking as well. So will now try changing my write case to make sure it works fine.

But on the T3.5 Wouldn't it make sense to say that the DMA channel is for WRITE as I am not sure how it is useful to as a Read channel at least in a generic way.

So in the Read case or Transfer case, does it make sense to use DMA at all. Assuming I wish to support it, I would assume that either TX or RX would have to be handled probably by an ISR. So if you are using an ISR for the RX side, does it help anything to then still use DMA for the TX side? I guess the easiest way to find out is to try...

tni · Apr 26, 2017

KurtE said:
But on the T3.5 Wouldn't it make sense to say that the DMA channel is for WRITE as I am not sure how it is useful to as a Read channel at least in a generic way.

So in the Read case or Transfer case, does it make sense to use DMA at all. Assuming I wish to support it, I would assume that either TX or RX would have to be handled probably by an ISR. So if you are using an ISR for the RX side, does it help anything to then still use DMA for the TX side?

I don't think it makes sense with an ISR.

But if you are willing to slow, as SPI master, you have timing control. So you could have the DMA triggered by a timer.

Or, you could use the SPI DMA mux trigger for RX and a linked DMA channel (triggered by minor loop completion) for TX (first RX triggers second TX). You will probably have a pause of 20-50 clock cycles between sent words.

KurtE · Apr 28, 2017

Thanks, I have been playing around with some of this. Not going as quickly as I would like as well it is spring time and I am building a new vegetable garden area.

Currently I have the code setup on T3.5 for SPI1 (and when I fix a quick and dirty setup) SPI2 to be able to do Async transfers, where, the DMA channel is used for TX and I have a simple SPI ISR that takes care of the Read side. Which appears to be working now.

I also have the code setup like you mentioned for the Write only case to only setup to use DMA on the TX side and let the RX side go into the FIFO which you then clear at the end. However with this I am running into an issue on SPI (0)

That is I have the code setup to update the DMAChannel to then have the TX Channel do an ISR and disable at completion.
So then suppose I transfer some number of characters (128 bytes in my test case). When the SPI/DMA setup puts the last item into the PUSHR queue, it disables itself and calls my ISR, Where I do a couple of things and call the users function pointer, which in my case disables the CS pin. That is the code is logically wanting to do something like: Enable CS, Do Transfer, Disable CS... But the problem in this case is the Transfer has not actually completed yet.
So I disable it before done...

When you use the ISR from the RX channel, then you know that the transfer has completed. Trying to decide best way to handle this? Could go back to always using Read side... Could maybe spin in ISR waiting for TXRS to say it is done... But don't like spinning...

Back to playing and garden work!

tni · Apr 28, 2017

KurtE said:
That is I have the code setup to update the DMAChannel to then have the TX Channel do an ISR and disable at completion.
So then suppose I transfer some number of characters (128 bytes in my test case). When the SPI/DMA setup puts the last item into the PUSHR queue, it disables itself and calls my ISR, Where I do a couple of things and call the users function pointer, which in my case disables the CS pin. That is the code is logically wanting to do something like: Enable CS, Do Transfer, Disable CS... But the problem in this case is the Transfer has not actually completed yet.
So I disable it before done...

When you use the ISR from the RX channel, then you know that the transfer has completed. Trying to decide best way to handle this? Could go back to always using Read side... Could maybe spin in ISR waiting for TXRS to say it is done... But don't like spinning...

Look at the EOQ flag (part of the PUSHR command word). If you set it for the last transmit word, you can get a SPI interrupt (SPIx_RSER: EOQF_RE / Finished Request Enable).

(At least, I hope it only gets triggered once the word has really been shifted out, I haven't tried it.)

KurtE · Apr 28, 2017

tni said:
Look at the EOQ flag (part of the PUSHR command word). If you set it for the last transmit word, you can get a SPI interrupt (SPIx_RSER: EOQF_RE / Finished Request Enable).

Thanks,

I was thinking of that as an alternative. To do so I think I may need to break up the TX DMAChannel into a DMASettings chain.
Maybe two or three elements, something like:
<Element 1>: Maybe if need to switch to 1 byte transfers: Need to set the right CTAR... put first byte of transfer buffer into this one.
Element 2: Main output (count -1 or -2)
Element 3: Again setup for 4 byte transfer to include the EOQ flag on it.

Note: So far I have not dealt with the element 1 stuff and may also special case if CTAR1 is in use and count is even... Would be again faster. Would need to clear EOQ before the transfer starts.

Thanks Will experiment some more trying this out.

KurtE · Apr 29, 2017

I am now Trying the approach above where I break up the TX chain into three DMASettings objects.

The three element chain appears to work on T3.6 on all three SPI busses. Was having issues on 3.5 so then tried today with SPI on T3.2 and it also has issues. Still trying to debug:

I have code in place to dump the TCDs. So just before I do a transfer of 128 bytes total where the send buffer init with 0, 1, 2,3,4, ...
I have the Chain setup like:

Code:

ADDR     TCD        SADR  ATTR  SOFF  NBYTES   SLAST     DADR    CITR DOFF DLAST   BITR   CSR
1fff9708 40009000:1fff93a0 0202 0004 00000004 fffffffc 4002c034 0001 0000 1fff9340 0001 0010 
1fff92e0 1fff9300:1fff93a0 0202 0004 00000004 fffffffc 4002c034 0001 0000 1fff9340 0001 0010 
1fff9320 1fff9340:1fff9231 0000 0001 00000001 ffffff82 4002c034 007e 0000 1fff9380 007e 0010 
1fff9360 1fff9380:1fff93a4 0202 0004 00000004 fffffffc 4002c034 0001 0000 1fff9300 0001 0018 
1fff9718 40009020:4002c038 0000 0000 00000001 00000000 1fff919c 0080 0001 ffffff80 0080 000a 

Transfer timed out!
1fff9708 40009000:1fff9236 0000 0001 00000001 ffffff82 4002c034 0079 0000 1fff9380 007e 0010 
1fff9718 40009020:4002c038 0000 0000 00000001 00000000 1fff919e 007e 0001 ffffff80 0080 000a

The first one is the DMA channel for TX, followed by the Three DMASettings (first one should be dup of TX channel. The last one is the RX Channel

After it appeared like the DMA failed/hung as I waited for up to 100ms, I then dumped the TX DMAChannel and the RX DMAChannel

I wonder if it does not like that the first DMA output outputs 4 bytes to PUSHR register. With the memory initialized to have the value:
_dmaFirstByte = *write_data | SPI_PUSHR_CONT | SPI_PUSHR_CTAS(0);

The 2nd Chain is setup to output one byte to PUSHR register with COUNT -2 bytes

The 3rd one again outputs 4 bytes to PUSHR with the value: _dmaLastByte = write_data[count-1] | SPI_PUSHR_CTAS(0) | SPI_PUSHR_EOQ;

The Logic Analyzer output, shows that the DMA output 3 bytes to MOSI: 0x00, 0x92, 0x01

It looked to me like the RX Channel , that it had processed 2 bytes as count was 0x80 and is now 0x7e
The DX Channel, looks Screwy to me. It looks like it did the Gather to copy in 2nd DMASetting, but the SADR value looks odd... Maybe it turns out that it does not like changing transfer sizes between chained elements.

Still investigating.

KurtE · Apr 30, 2017

So far no luck getting the 4 byte writes to SPI.PUSHR register to work on T3.2, probably also 3.5...

I am wondering sort of how DMA and SPI works. That is if you do an 8 bit write to PUSHR, it will write out the 8 bit value to the queue, like wise if you do an 16 bit write to PUSHR register it will add the 16 bit data to the queue.

But does DMA do a 16 bit write, or an 8 bit write to PUSHR followed by an 8 bit write to PUSHR+1... And if so how does SPI deal with it...
Maybe sounds like quick experiment...

Update:
On 16 bit write: not well... I hacked it to send 16 bit writes for the 128 byte buffer. Then looking my testing plus logic analyzer, maybe I was not setup properly to have SPI CTAR1 mode, will hack it again...
But probably with CTAR0 I get the bytes in the order
0, 1, 2, 3, 4, 6, 8, a, c, e, 10... So got all the bytes up till the queue filled and then skipped every other one.

Hacked test again, had an transfer16 just before this so CTAR1 is in high word...
Same behavior, where once queue was filled it skipped every other one...

tni · Apr 30, 2017

Apparently, the K20 and K64 (or their manuals) are buggy and byte-writes to PUSHR are only semi-working. The command word doesn't get reused, a byte-write results in a command word of 0. (K66 / Teensy 3.6 works correctly and reuses the command word.)

Scatter-gather DMA completely corrupts the command-word when the TCD is reloaded, I get a PUSHR value of "0x1cff9200". This has bit 27 / EOQ set, so the transfer gets terminated.

Scatter-gather DMA does work properly, if 32-bit writes are used.

Some code to play around with:

Code:

#include <DmaChannel.h>
#include <SPI.h>
#include <array>

using spi_value_t = uint8_t;

auto buffer = [](){
    std::array<spi_value_t, 64> res = {};
    for(auto& elem : res) elem = SPI_PUSHR_CONT | SPI_PUSHR_CTAS(0) | 0x42u;
    return res;
}();

DMAChannel dma_tx;
DMASetting dma_s1;

void printTCD(DMABaseClass::TCD_t& tcd) {
    Serial.printf("SADDR: %x SOFF: %u ATTR: %u NBYTES: %u SLAST: %u DADDR: %x DOFF: %u CITER: %u DLASTSGA: %u CSR: %u BITER: %u\n",
        tcd.SADDR, tcd.SOFF, tcd.ATTR, tcd.NBYTES, tcd.SLAST, tcd.DADDR, tcd.DOFF, tcd.CITER, tcd.DLASTSGA, tcd.CSR, tcd.BITER);
}

void setup() {
    Serial.begin(115200);
    delay(2000);
    Serial.printf("buffer: %x - %x\n", (uint32_t) buffer.data(), (uint32_t) buffer.data() + sizeof(buffer));

    SPI.begin();
    SPI.beginTransaction( {100, MSBFIRST, SPI_MODE0 } );

    dma_s1.destination((volatile spi_value_t&) SPI0_PUSHR);
    dma_s1.sourceBuffer(buffer.data(), sizeof(buffer));
    dma_s1.replaceSettingsOnCompletion(dma_s1);

    dma_tx = dma_s1;
    dma_tx.triggerAtHardwareEvent(DMAMUX_SOURCE_SPI0_TX);

    Serial.println("TCD dma_tx");
    printTCD(*dma_tx.TCD);
    Serial.println("TCD dma_s1");
    printTCD(*dma_s1.TCD);
   
    SPI0_SR = 0xFF0F0000;
    SPI0_RSER = SPI_RSER_RFDF_RE | SPI_RSER_RFDF_DIRS | SPI_RSER_TFFF_RE | SPI_RSER_TFFF_DIRS;

    SPI0_PUSHR = SPI_PUSHR_CONT | SPI_PUSHR_CTAS(0) | 0x42u;
    Serial.printf("Initial SPI0_PUSHR: %x\n", SPI0_PUSHR);
    dma_tx.enable();
}

uint32_t dma_addr = 0;

elapsedMillis report_timer_addr;
elapsedMillis report_timer_sr;

void loop() {
    uint32_t new_dma_addr = (uint32_t) dma_tx.sourceAddress();
    //if(report_timer_addr > 100) {
        report_timer_addr = 0;
        if(dma_addr != new_dma_addr) {
            Serial.printf("DMA src: %x    dest: %x     SPI0_SR: %x     SPI0_PUSHR: %x\n",
                new_dma_addr, (uint32_t) dma_tx.destinationAddress(), SPI0_SR, SPI0_PUSHR);
            printTCD(*dma_tx.TCD);
        }
        dma_addr = new_dma_addr;
    //}
    if(report_timer_sr > 2000) {
        report_timer_sr = 0;
        Serial.printf("SPI0_SR: %u    DMA src: %x    dest: %x\n", SPI0_SR, new_dma_addr, (uint32_t) dma_tx.destinationAddress());
        printTCD(*dma_tx.TCD);
    }
}

KurtE · Apr 30, 2017

Thanks, I will play around some with this later.

I think for now I semi punted.

The main thing I was handling was the Write case where we called their call back function before the operation completed.
The earlier approach was to use both RX/TX dma channels and wait until RX got the right number of bytes...

Now I don't use the chaining like I mentioned before, but sort of a kludge. What I am doing is still only using the TX dma,
but I ask the DMA to only output N-1 bytes, and set the Interrupt on completion option for DMA. When I get this interrupt, I then do a PUSHR of the last byte with the EOQ option and then have the SPI ISR handle this, which then calls the users call back function.

Maybe not the cleanest way but on the other hand using the DMASetting objects uses at least 32 bytes of user memory for each Setting, so was using up that extra memory...

Now to disable a bunch of the debug outputs and verify it some more, plus then try back on 3.5 and 3.6, and then maybe make similar change for LC...

My test app currently looks like:

Code:

// Quick and dirty extract of parts of Teensyview to compare speed of different SPI methods to compare
// speed.
#include <SPI.h>
#include <DMAChannel.h>

#define DBGSerial Serial

#define BUFFER_SIZE 128
#define RXBUFFER_FILL_CHAR 0xC5
uint8_t buffer[BUFFER_SIZE];
uint8_t rxbuffer[BUFFER_SIZE+1];

volatile uint8_t our_dma_state = 0;
uint32_t clockRateSetting = 1000000;

// Only define one of these
#define _SPI  SPI
//#define _SPI  SPI1 
//#define _SPI  SPI2

#define USE_PIN_PORT
#define PIN_DC 9
#define PIN_CS 10

// BUGBUG Defined for T3.6 beta board
#define MISO2 51
#define MOSI2 52
#define SCK2  53



#ifdef USE_PIN_PORT
#define ASSERT_DC() *_dcport  &= ~_dcpinmask
#define RELEASE_DC() *_dcport  |= _dcpinmask

#define ASSERT_CS() *_csport  &= ~_cspinmask
#define RELEASE_CS() *_csport |= _cspinmask

#else
#define ASSERT_DC() digitalWrite(PIN_DC, LOW)
#define RELEASE_DC() digitalWrite(PIN_DC, HIGH)

#define ASSERT_CS() digitalWrite(PIN_CS, LOW)
#define RELEASE_CS() digitalWrite(PIN_CS, HIGH)
#endif

volatile uint8_t * _csport, * _dcport;
uint8_t _cspinmask, _dcpinmask;

#ifndef _SPI
#define _SPI SPI 
#endif

void setup()
{
  while (!DBGSerial && millis() < 3000) ;
  DBGSerial.begin(115200);
  delay(100);
  DBGSerial.println("Start setup");
  pinMode(PIN_CS, OUTPUT);
#ifdef USE_PIN_PORT
  _csport = portOutputRegister(digitalPinToPort(PIN_CS));
  _cspinmask = digitalPinToBitMask(PIN_CS);
#endif

  RELEASE_CS();
  DBGSerial.println("Before pinMode DC"); Serial.flush();
  pinMode(PIN_DC, OUTPUT);
#ifdef USE_PIN_PORT
  _dcport = portOutputRegister(digitalPinToPort(PIN_DC));
  _dcpinmask = digitalPinToBitMask(PIN_DC);
#endif
  DBGSerial.println("Before SPI Begin"); Serial.flush();

#if defined(__MK66FX1M0__)
  SPI2.setMOSI(MOSI2);
  SPI2.setMISO(MISO2);
  SPI2.setSCK(SCK2);
#endif

  _SPI.begin();
  DBGSerial.println("End Setup"); Serial.flush();


}


//===========================================================================
// Main Loop
//===========================================================================
void loop()
{
  DBGSerial.println("*** Loop called ***");
  InitTestBuffers();
  
  _SPI.beginTransaction(SPISettings(clockRateSetting, MSBFIRST, SPI_MODE0));


  //---------------------------------------------------------------------------
  // Simple Test to see if simple transfer works
  //---------------------------------------------------------------------------
  uint8_t i;

  uint32_t start_time = micros();
  ASSERT_CS();
  for (i = 0; i < sizeof(buffer); i++) {
    _SPI.transfer(buffer[i]);
  }
  RELEASE_CS();
  Serial.println(micros() - start_time, DEC); Serial.flush();
  delayMicroseconds (25);

  //---------------------------------------------------------------------------
  // Simple Test to see if simple transfer16 works
  //---------------------------------------------------------------------------
  start_time = micros();
  ASSERT_CS();
  for (i = 0; i < sizeof(buffer); i += 2) {
    _SPI.transfer16((buffer[i] << 8) | buffer[i + 1]);
  }
  RELEASE_CS();
  Serial.println(micros() - start_time, DEC); Serial.flush();

  delayMicroseconds (25);
  start_time = micros();
  ASSERT_CS();
  _SPI.transfer(buffer, NULL, sizeof(buffer));
  RELEASE_CS();
  Serial.println(micros() - start_time, DEC); Serial.flush();

  //---------------------------------------------------------------------------
  // Simple Test to see if new transfer works. 
  //---------------------------------------------------------------------------
  Serial.println("Try new Transfer!!!"); Serial.flush();
  delayMicroseconds (25);
  start_time = micros();
  ASSERT_CS();
  _SPI.transfer( buffer, rxbuffer, sizeof(buffer));
  RELEASE_CS();
  Serial.println(micros() - start_time, DEC); Serial.flush();
  CheckBuffers();


  //---------------------------------------------------------------------------
  // DMA Transfer operaton  - First output one byte to make sure in single mode
  //---------------------------------------------------------------------------
  // Now try DMA...
  InitTestBuffers();
  Serial.println("Try DMA Transfer after 8 bit write"); Serial.flush();
  //_SPI.transfer16(0xffff); // See if calling with 1 byte changes things... 
  _SPI.transfer(0);
  delayMicroseconds (25);
  start_time = micros();
  ASSERT_CS();
  our_dma_state = 1;
  Serial.printf("SPI Pushr: %x\n", SPI0_PUSHR);
  _SPI.transfer( buffer, rxbuffer, sizeof(buffer), &Our_DMA_callback);
  //while (our_dma_state)   ;
  uint32_t dma_start_time = millis();
  while (our_dma_state && ((millis()-dma_start_time) < 100))  ; 
  extern void dumpDMA_TCD(DMABaseClass *dmabc);
  if (our_dma_state) {
    Serial.println("Transfer timed out!");
    Serial.printf("DMA CR: %x ES: %x ERQ: %x EEI: %x ERR: %x\n", DMA_CR, DMA_ES, DMA_ERQ, DMA_EEI, DMA_ERR);
    dumpDMA_TCD(_SPI._dmaTX);
    dumpDMA_TCD(_SPI._dmaRX);
    Serial.printf("RXB: %x %x %x %x %x\n", rxbuffer[0], rxbuffer[1], rxbuffer[2], rxbuffer[3], rxbuffer[4]);
  }
  Serial.println(micros() - start_time, DEC); Serial.flush();
  Serial.println(our_dma_state, DEC);
  CheckBuffers();

  //---------------------------------------------------------------------------
  // DMA Write operation
  //---------------------------------------------------------------------------
  Serial.println("Try DMA Write!!!"); Serial.flush();
  delayMicroseconds (10);
  start_time = micros();
  ASSERT_CS();
  our_dma_state = 1;
  _SPI.transfer( buffer, NULL, sizeof(buffer), &Our_DMA_callback);
  while (our_dma_state)   ;
  Serial.println(micros() - start_time, DEC); Serial.flush();

  //---------------------------------------------------------------------------
  // DMA Read operation
  //---------------------------------------------------------------------------
  Serial.println("Try DMA Read!!!"); Serial.flush();
  delayMicroseconds (10);
  start_time = micros();
  ASSERT_CS();
  our_dma_state = 1;
  _SPI.transfer( NULL, rxbuffer, sizeof(buffer), &Our_DMA_callback);
  while (our_dma_state)   ;
  Serial.println(micros() - start_time, DEC); Serial.flush();

  //---------------------------------------------------------------------------
  // DMA Transfer operaton  - Output 16 bit to see if we recover
  //---------------------------------------------------------------------------
  // Now try DMA...
  Serial.println("Try DMA Transfer after 16 bit write!!!"); Serial.flush();
  _SPI.transfer16(0xffff); // See if calling with 1 byte changes things... 
  delayMicroseconds (25);
  InitTestBuffers();
  start_time = micros();
  ASSERT_CS();
  our_dma_state = 1;
  _SPI.transfer( buffer, rxbuffer, sizeof(buffer), &Our_DMA_callback);
  //while (our_dma_state)   ;
  dma_start_time = millis();
  while (our_dma_state && ((millis()-dma_start_time) < 100))  ; 
  Serial.println(micros() - start_time, DEC); Serial.flush();
  Serial.println(our_dma_state, DEC);
  CheckBuffers();

  _SPI.endTransaction();


  delay(2500);
}


//=============================================================================
// DMA callback
//=============================================================================
void Our_DMA_callback() {
  Serial.println("DMA Callback");
  RELEASE_CS();
  our_dma_state = 0;
}

//=============================================================================
// Init the test buffer
//=============================================================================
void InitTestBuffers() {
  uint8_t i;

  for (i = 0; i < BUFFER_SIZE; i++) {
    buffer[i] = i;
  }

  memset(rxbuffer, RXBUFFER_FILL_CHAR, sizeof(rxbuffer));
}

//=============================================================================
// Check Buffers
//=============================================================================
bool CheckBuffers() {
  bool OK = true;
  for (uint8_t i=0; i < sizeof(buffer); i++) {
    if (buffer[i] != i) {
        Serial.printf("TX change(%d) %x\n", i, buffer[i]);
        OK = false;
    }
    if (buffer[i] != rxbuffer[i]) {
        Serial.printf("Error(%d) %x != %x\n", i, buffer[i], rxbuffer[i]);
        OK = false;
        break;
    }
  }
  if (OK && (rxbuffer[BUFFER_SIZE] != RXBUFFER_FILL_CHAR)) {
    Serial.printf("Error RX Overwrite: %x\n", rxbuffer[BUFFER_SIZE]);
    OK = false;
  }
  //Serial.printf("  %x %x %x %x %x\n", rxbuffer[i+1], rxbuffer[i+2], rxbuffer[i+3], rxbuffer[i+4], rxbuffer[i+5]);
  return OK;
}

KurtE · May 1, 2017

I am sort of going around in circles

The code to use the interrupt to then do the pushr with the EOQF worked on T3.2 and T3.5, but was running into issues with T3.6 that after I did a read operation, then did a write or transfer, it would not work... Tried a few things.

But then wondered if adding this complexity was gaining anything? So I did a quick update to my wait for completion code. That incremented a counter in the loop and when it exited, it printed the delta micros plus the loop count. Again pretty simple stuff:

Code:

void WaitUntilTransferCompletes() {
  uint32_t dma_start_time = micros();
  uint32_t loop_count = 0;
  while (our_dma_state && ((micros()-dma_start_time) < 100000))  loop_count++;
  Serial.printf("dt: %d Count: %d", (micros()-dma_start_time), loop_count); 
  extern void dumpDMA_TCD(DMABaseClass *dmabc);
  if (our_dma_state) {
    Serial.println("Transfer timed out!");
    Serial.printf("DMA CR: %x ES: %x ERQ: %x EEI: %x ERR: %x\n", DMA_CR, DMA_ES, DMA_ERQ, DMA_EEI, DMA_ERR);
    dumpDMA_TCD(_SPI._dmaTX);
    dumpDMA_TCD(_SPI._dmaRX);
    Serial.printf("RXB: %x %x %x %x %x\n", rxbuffer[0], rxbuffer[1], rxbuffer[2], rxbuffer[3], rxbuffer[4]);
  }
}

With this I would expect that if the system is doing less during the DMA SPI transfer, the loop count would go up.
But if you look at some of the outputs, you see:

Code:

*** Loop called ***
1101
1096
1095
Try new Transfer!!!
1095
Try DMA Transfer after 8 bit write
dt: 1101 Count: 48191122
Try DMA Read!!!
dt: 1101 Count: 48221122
Try DMA Write!!!
dt: 1100 Count: 47801121
Try DMA Transfer after 16 bit write!!!
dt: 2193 Count: 96142214
*** Loop called ***

You will see the loop count went down for write, as time was spent during the interrupts (2 instead of 1)

So I changed the code back to simply setup a dummy byte variable (bit_bucket), that does not increment and then do simply the transfer.
When I completed this, the counts were more or less identical to the Read/Transfer versions.

Then note in the above, you will notice the transfer after doing a transfer(16) was double the time. That is because on T3.6 it remembered the high word of the PUSHR and was doing 16 bit transfers.

So I changed the code to instead of have DMA do the first bytes PUSHR, just before I start up the DMA, I do:

Code:

port.PUSHR = dma_first_byte | SPI_PUSHR_CTAS(0) | SPI_PUSHR_CONT;

Which puts the SPI into 8 bit mode plus it passed the CONT bit which shortens the time up between bytes of a transfer.

So with this the timings are:

Code:

*** Loop called ***
1101
1096
1095
Try new Transfer!!!
1095
Try DMA Transfer after 8 bit write
dt: 1097 Count: 48021119
Try DMA Read!!!
dt: 1096 Count: 48021119
Try DMA Write!!!
dt: 1097 Count: 47971117
Try DMA Transfer after 16 bit write!!!
dt: 1097 Count: 47961118

Which you will notice two things. All of the dma transfers sped up a little and the transfer after 16 bit write no longer had issue.

So will probably stick with keeping it simple.

One thing I wondered is, is there a DMA destination that the system knows as a sink or bit bucket. Did not notice anything in docs, but probably missed it.

Now back to cleaning this up... Plus again check how this impacts this on 3.5 on SPI1/2...

KurtE · May 1, 2017

I pushed up the current version with ASYNC support to my SPI fork (SPI-Multi-one-class branch).

On the T3.5 I kept the idea of using the interrupt on TX dma completion and then output the last byte with EOQ flag and then use SPI interrupt when the EOQF flag is reached. This gave me proper timing and it sped up the Write operation versus Transfer or Read.

Actually I should qualify that. All three operation took about the same time to complete, however my loop counter waiting for it to complete gave quite a few more iterations now on the Write case as only handling 2 interrupts instead of 1 per byte. Will be interesting to see how all of these interact. May turn out I may end up wanting to change the operation of the one channel depending on operation.
a) Maybe like Write only: buffer->PUSHR
b) transfer/Read: POPR->RXBUFFER - as maybe safer for data not being lost. That is if we take too long to respond to interrupt for RX, we lose data, if we take awhile to respond to TX interrupt, probably just slows the operation down...

May revisit. But for now may be fun for people to play with.

I did a quick run to verify it is not complete toast (T3.2 SPI), T3.5/6 SPI, SPI1, SPI2, and LC SPI1...

Now I think I may play with sing it.

Like maybe an update to my version of Sparkfun_Teensyview: Where main update function uses new transfer methods (not aync).

Plus add new member to do async update.

Then try some simple demo which has one of these displays per SPI buss and have them all update at the same time...

Should be fun

defragster · May 1, 2017

Sounds like fun. Bummer on the circular progression - but finding the idiosyncrasies of the three MCU's not having the same support docs/interface/functionality. At least you made it around and found a good general solution ??? Will be interesting if the T_View runs differently as it was already doing bulk writes from buffer?

I've read along - but not actually participated . . . so an update/write can be done and then return without delay? Suppose that only comes with async method? Non-async just completes with fewer interrupts and less bus downtime? The SPI is setup for DMA xfer and queued for completion. The user buffer is tied until completion?

Speaking of queuing - can a second request be handled (for storage) before the first is complete? Thinking of what I did for Talkie - but that was just a pointer to a static set of sounds values.

KurtE · May 1, 2017

Currently it will error out if you try to do a second Async transfer while the first one is active. However you should be able to queue one up during the callback from the call... Will try that out with Teensyview, as it is a pretty simple output.

Also currently the normal synchronous spi transfer calls do not share code with the Asynchronous versions. Could do that at some point if desired. But always trade offs, like if you used DMA versions on SPI and SPI1 on TLC, I think it would use up all of your DMA channels.

Also wondering what is the best approach for using Call backs when used with Classes that may have multiple instances.
Sometimes I wish that the Callback function and setup had the ability to pass in some parameter, either something like a uint32_t or void* or ..
So you could if you wanted do something like: SPI.transfer(buff, NULL, 128, myfunc, this);

Now back to playing... Actually about time to feed dogs, hummingbirds...

ErikRasmussen · Nov 12, 2018

Dma Spi on Teensy 3.6

I am working on a system where two IR CCD cameras deliver data to a Teensy 3.6 running @ 180 MHz. Data transfer is in blocks of 164 bytes via SPI0 and SPI1 @ 20 MHz. I have modified the SPI transfer command so I can start reading from both SPI channels at the same time and do some other work during the data sampling.

Code example:
for (ui8 = 0; ui8 < PAC_SIZE; ui8++)
{
SPI.transfer1(0); // start two SPI readings
SPI1.transfer1(0);
// ...
pac0[ui8] = SPI.transfer2(); // wait for SPI end and read data
pac1[ui8] = SPI1.transfer2();
}

It takes 120 us to read the two blocks of data.

In order to speed up the process and to be able to do other work during the data reading, I would like to use DMA for the SPI data transfer. I noticed the example at https://github.com/crteensy/DmaSpi. I modified example 2 to look like this:

Serial.println("Hi!");
waitForKeyPress();
t = micros();
SPI.begin();
SPI.beginTransaction(SPISettings(SPI_CLOCK, MSBFIRST, SPI_MODE3));
SPI.setMISO(PIN_MISO0);
SPI.setSCK(PIN_SCK0);
SPI1.begin();
SPI1.beginTransaction(SPISettings(SPI_CLOCK, MSBFIRST, SPI_MODE3));
SPI1.setMISO(PIN_MISO1);
SPI1.setSCK(PIN_SCK1);

DMASPI0.begin();
DMASPI1.begin();
DMASPI0.start();
DMASPI1.start();
DmaSpi::Transfer trx0(nullptr, 0, nullptr);
DmaSpi::Transfer trx1(nullptr, 0, nullptr);
while(trx0.busy() || trx1.busy());

trx0 = DmaSpi::Transfer(nullptr, PAC_SIZE, dest0, 0);
trx1 = DmaSpi::Transfer(nullptr, PAC_SIZE, dest1, 0);
DMASPI0.registerTransfer(trx0);
DMASPI1.registerTransfer(trx1);
while(trx0.busy() || trx1.busy());

DMASPI0.stop();
DMASPI1.stop();
DMASPI0.end();
DMASPI1.end();

t = micros() - t;
Serial.print(t); Serial.println(" us");

The result is 402 us!

Questions:
1) Is my DMA code correct?
2) Why is DMA so slow?

Thanks for your time! Best regards!

KurtE · Nov 12, 2018

Hopefully someone who uses that library can help. Note: the above code would probably not work at all on T3.5 which is the subject of this thread.

If it were me, I might simply try it using the newer non-blocking transfers that are supported by the main SPI library.
There are a few recent thread on this including example program to write to two SPIs at same time:
Example in: https://forum.pjrc.com/threads/49026-Teensy-SPI-question?p=189084&viewfull=1#post189084

Speed: Assuming your SPI_CLOCK is 20000000 Then should run at 20mhz as F_BUS would be 60mhz and divider of 3 could be used...

As for how long this took. It may depend on lots of things, first off you are measuring the whole time, including the code that initializes the SPI pins and the like. Maybe you want to move your initial t=micros(), just before the code that actually does the transfers...

Again I don't know how DMASPI library initializes/uses the SPI bus. Things like what options that are turned on during the transfer.

Teensy 3.5 SPI DMA?

Senior Member+

Senior Member+

Well-known member

Senior Member+

Well-known member

Senior Member+

Well-known member

Senior Member+

Well-known member

Senior Member+

Senior Member+

Senior Member+

Well-known member

Senior Member+

Senior Member+

Senior Member+

Senior Member+

Senior Member+

Member

Senior Member+