T4, SPI, DMA multiple transactions, MISO and MOSI are tristate and one pin

sicco

Well-known member
Teensy4.1 project has several SPI sensor devices tied to just one Teensy SPI port. The sensor has its output SDO and input SDI on the same sensor pin. So it's bi-directional SDIO for reading data from and writing to the sensors.

Every 10 ms, a timer interrupt triggers successive readout of 4 of those sensors. A readout means CS low, send one byte, and then receive 6 bytes back. and CS high again (it's a 3 axis accelerometer). And that 4 times in a row for the 4 sensors that all have their own CS pin, but share SCK and SDIO. SCK frequency cannot be super high (long wires).

My problem is that as this sensor interrogation happens all in a timer interrupt service routine, I am blocking other tasks while the CPU waits for each of 2*4= 8 SPI transactions to be done. 10% of the time, the CPU is waiting for SPI transactions to complete.

What I want instead is just one trigger that says go and execute this series of SPI transactions as listed below, say:
1. claim SPI port, CS1 low
2. SPI write 1 command byte 'read xyz words'
3. SPI read 6 bytes
4. CS1 high, CS2 low
5. SPI write 1 command byte 'read xyz words'
6. SPI read 6 bytes
7. CS2 high, CS3 low
8. SPI write 1 command byte 'read xyz words'
9. SPI read 6 bytes
10. CS3 high, CS4 low
11. SPI write 1 command byte 'read xyz words'
12. SPI read 6 bytes
13. CS4 high, release claim on SPI port
And all of that without ever polling (in a blocking way) for peripheral (or DMA) status bits that indicate that the LPSPI port is busy clocking bits in or out.

What I want to happen under the hood is that each of these 8 SPI transactions run in sequence, automatically. Without me firing these one by one.
As in I prepare the table with pointers to 8 buffers in RAM, indices for CS pin numbers, #bytes per transaction, and a read/write direction flag and then after one trigger call, the SPI transactions, all 13 steps, just happen, in a none-blocking way.

Do I need to rewrite a new low level SPI driver myself, or can I reuse something precooked from a library?
I looked in the current T4 SPI library sources, it seems to be doing DMA by default (does it indeed?), its interrupt/event based I think, but I cannot find anything yet on modes with bidirectional SDI/SDO in one wire, and there's no clue on how to make it execute a series of transactions to multiple devices with multiple CS pins.

Anyone any ideas, hints?
 
Only SPI I spent time with was for the: ...\hardware\teensy\avr\libraries\XPT2046_Touchscreen\XPT2046_Touchscreen.cpp

It is blocking - but does do stepwise SPI transfer in: void XPT2046_Touchscreen::update()

It might give the needed command process to complete the task.
 
Anyone any ideas, hints?

Before QSPI stood for Quad SPI, it stood for Queued SPI. Most SPI, LPSPI in T4.1 has 16-element command/data queues you can set up, then start, poll for "done" (or get an interrupt), extract the data, restart, etc. You can also have the queue repeat automatically. Command fields include specification of CS, with optional delays between commands that use different CS. The CS have to be the pins associated with the SPI module, as opposed to the way Arduino lets you assign any pin as CS. I don't think you can change clock rate between transactions, but I think you can change mode, CS polarity, and maybe other things. Since you have control of 4 CS pins, you can de-mux on your own to generate up to 16 independent CS. If you look at library source SPI.cpp, you will see some use of the command/data registers, but not in a way that you can do what you're asking. You'll have to get into the manual and do it yourself. I've done it in the past on 683xx and Coldfire, which are not as complex, but write back if you try it and I'll help if I can.
 
Before QSPI stood for Quad SPI, it stood for Queued SPI. Most SPI, LPSPI in T4.1 has 16-element command/data queues you can set up, then start, poll for "done" (or get an interrupt), extract the data, restart, etc. You can also have the queue repeat automatically. Command fields include specification of CS, with optional delays between commands that use different CS. The CS have to be the pins associated with the SPI module, as opposed to the way Arduino lets you assign any pin as CS. I don't think you can change clock rate between transactions, but I think you can change mode, CS polarity, and maybe other things. Since you have control of 4 CS pins, you can de-mux on your own to generate up to 16 independent CS. If you look at library source SPI.cpp, you will see some use of the command/data registers, but not in a way that you can do what you're asking. You'll have to get into the manual and do it yourself. I've done it in the past on 683xx and Coldfire, which are not as complex, but write back if you try it and I'll help if I can.

Thank you Joe. Now I'm curious also about how the default SD card drivers operate. Say if I write a block of data to the Teensy41 SD card, appending something like 512 bytes blockwrite to an ExFAT SDFAT file, is that also in 'blocking' mode? Or is that task pushed to the background while other tasks (that do not yet need access to the same card) can process already? I think this relates also to the thread 'what's the fastest way to write to SD card' that's active today.
 
Thank you Joe. Now I'm curious also about how the default SD card drivers operate. Say if I write a block of data to the Teensy41 SD card, appending something like 512 bytes blockwrite to an ExFAT SDFAT file, is that also in 'blocking' mode? Or is that task pushed to the background while other tasks (that do not yet need access to the same card) can process already? I think this relates also to the thread 'what's the fastest way to write to SD card' that's active today.

Yes, I just posted to that thread again with more info. I'm really only learning about SD now, so if I'm wrong about this, I hope others will jump in with more definitive answers. The comments in SdFat TeensySdioLogger example explain that SD cards have a built-in 512-byte FIFO, so the SdFat driver does all transfers in 512-byte chunks. A 512-byte call to file.write() takes about 5 us, and I don't know whether that covers the entire transfer, or just setup time and then return while the transfer occurs in the background. The important thing to understand about SD is that after the write, the SD will sometimes do wear-leveling or something else, and the card can be unavailable for another write for ~40 ms (could be different depending on tye of card and interface). Based on what I've learned so far, to minimize blocking time on SD writes, you should use file.isBusy() as shown in TeensySdioLogger. If you do other stuff while waiting for file.isBusy() to be false, file.write(buf,512) will always return in about 5 us. That example also shows how to use SdFat's RingBuf class to buffer data for logging during the periods when file.isBusy() is true.
 
Before QSPI stood for Quad SPI, it stood for Queued SPI. Most SPI, LPSPI in T4.1 has 16-element command/data queues you can set up, then start, poll for "done" (or get an interrupt), extract the data, restart, etc. You can also have the queue repeat automatically. Command fields include specification of CS, with optional delays between commands that use different CS. The CS have to be the pins associated with the SPI module, as opposed to the way Arduino lets you assign any pin as CS. I don't think you can change clock rate between transactions, but I think you can change mode, CS polarity, and maybe other things. Since you have control of 4 CS pins, you can de-mux on your own to generate up to 16 independent CS. If you look at library source SPI.cpp, you will see some use of the command/data registers, but not in a way that you can do what you're asking. You'll have to get into the manual and do it yourself. I've done it in the past on 683xx and Coldfire, which are not as complex, but write back if you try it and I'll help if I can.

I got it working, starting from the Teensy library SPI.c version. First I stripped off everything that's not Teensy41 related, then removed all 'is transfer ready polling' code where the code was waiting for LPSPI status bit. And I stripped off all related to interrupts and LPSPI, but I kept the DMAisr. As I still wanted to maintain other 'conventional' SPI library use (like for the SD card, SdFat etc...) I ended up making a new class that I called T4_DMA_SPI. Made any SPI transaction DMA based, also when it's just a 1 byte transfer. Stripped off the SPI CS parts because I have >>3 SPI devices attached to the same SPI port. Plus I cannot afford that CS goes high in the middle of a bidirectional transaction (via the same MISO or MOSI pin first tranactions that first write, and then read -which is two transactions really- without CS going high in between...). So this was not so easy.
My new T4_DMA_SPI class has a FIFO table with up to 128 SPI transactions that can be scheduled in advance, and will be executed automatically after one 'go' command. Without polling. A transactions table entry has byte pointers for source and destination buffers, #bytes, CS pin number, and some extra mode control bits that specify if and when CS goes low and high, and how to toggle output - input mode for the SDIO (pin MOSI = pin MISO) bidirectional way.

So now I can do the SPI IO that was eating up about 20% of the total CPU time in the background, and I can use that 20% for real computational tasks - or save battery because polling for SPI ready 20% of the time is not smart I think. Plus, since the trigger to start the SPI IO was timer isr based, that timer isr at its priority will no longer block for may milliseconds my other isr's at a lower priority.

View attachment T4_DMA_SPI.cppView attachment T4_DMA_SPI.h
 
Wow, good work. So, this new class can coexist with the standard SPI library?

Yes, it co-exists with the default SPI drivers / classes.
A better version attached, now with an example sketch that reads out every 10 ms seven ST MEMS 3D magnetometers, one ST 3D accelerometer and a MPS magnetic encoder IC, all on the same SPI bus, all using just one SDIO wire for MISO and MOSI (aka SDO and SDI).
View attachment Example_T4_DMA_SPI-230312a.zip
 
It wasn't perfect yet. Probably still isn't perfect, but as it is better than last week's version, here's an update of my T4_DMA_SPI code. With example code for various ST and a MPS SPI chip being read out every 10 ms, in a non-blocking fashion.

Using pullups for MISO function pins that might otherwise end up floating.

This version lets you also toggle the MOSI and MISO lines. You do so by giving the CS pin number a negative value. I needed that functionality after I found out the hard way that some of the ST MEMS family chips such as the IIS2DH do not have the option to disable I2C mode. If other users of the shared SDIO line on a shared SPI bus decide to pull down the SDIO line while SCK is high then these ST MEMS ic's interpret that as I2C start condition and thereafter the rest is unpredictable and eventually destructive for any register settings in that ST MEMS chip. The MPS MAQ473 magnetic encoder therefore cannot share the SDIO pin with IIS2DH because as soon as CS goes low it starts driving its MISO to either zero or one depending on what it is reading and outputting. The only workaround then is to find another pin for SDI(O). That other pin can now be the SDO line.
 

Attachments

  • Example_T4_DMA_SPI-230317a.zip
    23.1 KB · Views: 93
@sicco I'd like to implement DMA based SPI transfers on a Teensy 4
I have a 27 byte payload to send and a 27 byte payload to receive back, and let it run continuously

Ideally something similar to the async Tx/Rx on the STM32 platform:
Code:
/**
  * @brief  Transmit and Receive an amount of data in non-blocking mode with DMA.
  * @param  hspi: pointer to a SPI_HandleTypeDef structure that contains
  *               the configuration information for SPI module.
  * @param  pTxData: pointer to transmission data buffer
  * @param  pRxData: pointer to reception data buffer
  * @note   When the CRC feature is enabled the pRxData Length must be Size + 1
  * @param  Size: amount of data to be sent
  * @retval HAL status
  */
HAL_StatusTypeDef HAL_SPI_TransmitReceive_DMA(SPI_HandleTypeDef *hspi, uint8_t *pTxData, uint8_t *pRxData, uint16_t Size)

so I'd just call HAL_SPI_TransmitReceive_DMA(&SPI, Tbuffer, Rbuffer, 27) when I want to kick it off - no need to stop it at the moment.

Do you think this can be done with your library?
 
Probably yes.
Is your Teensy the SPI the master or the slave? For both i did a DMA based code examples. This thread is about the master i think.
Is the 27 bytes payload to be transferred as first 27 bytes out and then 27 bytes in, or in and out at the same time, byte by byte? Both would be possible, but will need proper coding.
If master: what triggers the SPI data exchange? Timer? External interrupt? Loop()?
 
1. claim SPI port, CS1 low
2. SPI write 1 command byte 'read xyz words'
3. SPI read 6 bytes
4. CS1 high, CS2 low
5. SPI write 1 command byte 'read xyz words'
6. SPI read 6 bytes
7. CS2 high, CS3 low
8. SPI write 1 command byte 'read xyz words'
9. SPI read 6 bytes
10. CS3 high, CS4 low
11. SPI write 1 command byte 'read xyz words'
12. SPI read 6 bytes
13. CS4 high, release claim on SPI port
so I'd just call HAL_SPI_TransmitReceive_DMA(&SPI, Tbuffer, Rbuffer, 27) when I want to kick it off - no need to stop it at the moment.
Sorry, I have not been following this... Have you simply tried using the SPI library?

Code:
bool transfer(const void *txBuffer, void *rxBuffer, size_t count,  EventResponderRef  event_responder);

Could have your eventResonder method simply keep a state and after each one completes it starts the next one, changing the
state of the CS pins if necessary first.

Could have the last one automatically loop back to first. Could add a bool somewhere that would cancel...
 
Probably yes.
Is your Teensy the SPI the master or the slave? For both i did a DMA based code examples. This thread is about the master i think.
Is the 27 bytes payload to be transferred as first 27 bytes out and then 27 bytes in, or in and out at the same time, byte by byte? Both would be possible, but will need proper coding.
If master: what triggers the SPI data exchange? Timer? External interrupt? Loop()?
Either would work, but Ideally byte in byte out.
One T4 is the Master, the other T4 is a Slave
If I could get DMA transfers working on the slave side as well it would be ideal.

Just as the master reads from the buffer and receives to a buffer, the slave would do exactly the same


Sorry, I have not been following this... Have you simply tried using the SPI library?
I had actually posted my question before I’d decided to go through the SPI.h source code.
I did look into the ILI9488_t3 code, as I know there is use of DMA there, but I saw manual setup of DMA channels and had initially figured that the SPI library did not support async out of the box

Could have your eventResonder method simply keep a state and after each one completes it starts the next one, changing the
state of the CS pins if necessary first.
This is actually a very simple and effective approach. But can I not use the hardware CS pin and have it toggle automatically when I start a transfer? Or do I still need to set it high/low manually?
Ive not played much with SPI therefore I am asking
 
Either would work, but Ideally byte in byte out.
One T4 is the Master, the other T4 is a Slave
If I could get DMA transfers working on the slave side as well it would be ideal.
But what will trigger the SPI transaction in the master? A timer?
 
had actually posted my question before I’d decided to go through the SPI.h source code.
I did look into the ILI9488_t3 code, as I know there is use of DMA there, but I saw manual setup of DMA channels and had initially figured that the SPI library did not support async out of the box
If you look in the SPI.cpp cdoe you will see - IMXRT case in the version looking at about line 1828:
Code:
bool SPIClass::initDMAChannels() {
    // Allocate our channels.
    _dmaTX = new DMAChannel();
    if (_dmaTX == nullptr) {
        return false;
    }

    _dmaRX = new DMAChannel();
    if (_dmaRX == nullptr) {
        delete _dmaTX; // release it
        _dmaTX = nullptr;
        return false;
    }

    // Let's setup the RX chain
    _dmaRX->disable();
    _dmaRX->source((volatile uint8_t&)port().RDR);
    _dmaRX->disableOnCompletion();
    _dmaRX->triggerAtHardwareEvent(hardware().rx_dma_channel);
    _dmaRX->attachInterrupt(hardware().dma_rxisr);
    _dmaRX->interruptAtCompletion();

    // We may be using settings chain here so lets set it up.
    // Now lets setup TX chain.  Note if trigger TX is not set
    // we need to have the RX do it for us.
    _dmaTX->disable();
    _dmaTX->destination((volatile uint8_t&)port().TDR);
    _dmaTX->disableOnCompletion();

    if (hardware().tx_dma_channel) {
        _dmaTX->triggerAtHardwareEvent(hardware().tx_dma_channel);
    } else {
//        Serial.printf("SPI InitDMA tx triger by RX: %x\n", (uint32_t)_dmaRX);
        _dmaTX->triggerAtTransfersOf(*_dmaRX);
    }


    _dma_state = DMAState::idle;  // Should be first thing set!
    return true;
}
The is a one shot. But you can probably just call itself in the callback function, for continuous.

As the callback itself actually sometimes has to go again, as I did not create a DMAChain, so there is a max size of
transfer like 32k... The code detects that and relaunches if you asked for > ...
 
Okay, I got a base program to run continuously on the master side
Code:
#include <SPI.h>

// SPI configuration
#define SPI_CLOCK 1000000 // 1 MHz
#define SPI_CS_PIN 10     // Chip Select Pin

EventResponder spiEventResponder; // EventResponder for async transfer

uint8_t txBuffer[27] = {0};       // Data to send
uint8_t rxBuffer[27] = {0};       // Buffer for received data

volatile bool transferInProgress = false; // Tracks transfer status

void setup() {
    Serial.begin(115200);

    // Initialize SPI
    SPI.begin();
    SPI.beginTransaction(SPISettings(SPI_CLOCK, MSBFIRST, SPI_MODE0));

    pinMode(SPI_CS_PIN, OUTPUT);
    digitalWrite(SPI_CS_PIN, HIGH); // Deselect device

    // Configure EventResponder callback
    spiEventResponder.attachImmediate(spiCompleteCallback);

    // Fill txBuffer with initial data
    for (uint8_t i = 0; i < sizeof(txBuffer); i++) {
        txBuffer[i] = i;
    }

    // Start the first SPI transfer
    startSPITransfer();
}

void loop() {
    // Main loop can handle other tasks
    delay(100); // Simulate other processing
}

// Start a new SPI transfer
void startSPITransfer() {
    if (!transferInProgress) {
        transferInProgress = true;

        // Select the device
        digitalWrite(SPI_CS_PIN, LOW);

        // Start async SPI transfer
        bool success = SPI.transfer(txBuffer, rxBuffer, sizeof(txBuffer), spiEventResponder);
        if (!success) {
            Serial.println("Failed to start SPI transfer!");
            transferInProgress = false;
            digitalWrite(SPI_CS_PIN, HIGH); // Deselect device on failure
        }
    }
}

// SPI transfer complete callback
void spiCompleteCallback(EventResponder &event) {
    // Deselect the device
    digitalWrite(SPI_CS_PIN, HIGH);

    // Print received data for debugging
    Serial.println("SPI transfer completed!");
    for (size_t i = 0; i < sizeof(rxBuffer); i++) {
        Serial.print("Received byte ");
        Serial.print(i);
        Serial.print(": 0x");
        Serial.println(rxBuffer[i], HEX);
    }

    // Modify txBuffer if necessary
    for (uint8_t i = 0; i < sizeof(txBuffer); i++) {
        txBuffer[i]++; // Example: Increment data for the next transfer
    }

    // Reset the transferInProgress flag
    transferInProgress = false;

    // Start the next transfer
    startSPITransfer();
}

1736855116824.png


Now comes the question if async transfers can be supported on the slave end @sicco ?
 
I'm not quite sure that your above code is non-blocking. Looks like it waits with de-asserting SPI /CS pin until all 27 bytes have been read and written. So during that time, other code will not run?

The idea behind the DMA SPI example that I shared is that such blocking does not happen. The /CS de-assertion is triggered by an interrupt that says that all reading and writing (and possibly toggling tri-state SDI/SDO pin modes when using 3 wire SPI), has been completed.
But if blocking is not a big issue for your application, then no problem. But also no need really to go DMA mode then in your master...

Now comes the question if async transfers can be supported on the slave end @sicco ?

For a Teensy SPI slave, I'd say DMA is a must have. Because the slave will be clueless when exactly it will get bombarded by the master.

https://forum.pjrc.com/index.php?threads/spi-slave-for-t4-dma-spi-slave.73418/ has an example on how to do SPI DMA slaves in T4.
 
I'm not quite sure that your above code is non-blocking. Looks like it waits with de-asserting SPI /CS pin until all 27 bytes have been read and written. So during that time, other code will not run?
Other code should run just fine. His function: void spiCompleteCallback(EventResponder &event) {
gets called when the SPI transfer completes, from the SPI transfer complete interrupt... Now if you only has one device on
the SPI buss, you might get away with just leaving the SPI buss with the CS pin asserted.

Note: if you are doing it immediate, you may want to temper what stuff you do in the interrupt, like doing too many Serial writes as this can block...
 
I'm not quite sure that your above code is non-blocking. Looks like it waits with de-asserting SPI /CS pin until all 27 bytes have been read and written. So during that time, other code will not run?
Looking at the SPI.cpp source code, in the async transfer method, it seems that it will only return false if it's unable to allocate a DMA channel, or if the channel is currently active
Code:
if (_dma_state == DMAState::notAllocated) {
        if (!initDMAChannels())
            return false;
    }

    if (_dma_state == DMAState::active)
        return false; // already active

So I do believe that it will not wait for the transfer to complete to return a state

I'll try out your DMA Slave code on my 2nd Teensy to see how they work together.


I drew out the overall system I am building
1736881599287.png


My master Teensy is the DevBoard V5 that will be reading WAV audio from an SD card.
It will be displaying waveform data and other parameters on its LCD driven by eLCDIF and streaming the audio out to a DAC

My slave Teensy is used to do several things - QuadEncoder algorithem to calculate jog wheel speed, jog wheel touch sensor detection, SPI communications to the control panel to listen for button presses and ADC for a pitch slider, and LCD control for an ILI9488 that shows the jog wheel position. This Teensy is basically gathering data from its various sensors and packing it up to be sent to the master, while the master is actually sending commands to light up indicators on the control panel.

I've decided to split this out into two Teensies to avoid stressing out the master.
 
Back
Top