DMASPI library needs some (probably breaking) changes to really support multiple SPIs

Status
Not open for further replies.
My guess is yes - Look at the example code at the code talking about testing dummy data 0xff...
 
Sorry if this is a stupid question, but can I use this to read a stream of information from a SPI Flash IC in DMA mode? All I see is DMA transmission, not receiving.

Every SPI transfer both sends data to the slave and receives data from it. The DMASPI lets choose if you want to send dummy data to the slave instead of a full buffer, or if you want to discard received data instead of storing it in a buffer. So you can
  • send from a buffer and receive to a buffer
  • send dummy data and receive to a buffer
  • send from a buffer and discard received data
 
Thank you Christoph!

I figured out how to do what I was trying to do, but the result was disappointingly slower than a SPI.transfer(buffer,size) operation. Seems the Teensy is awesome in so many ways, but SPI, the DUE is superior. I will not be pursuing this particular project on the Teensy :(

Regards,

Graham
 
Thank you Christoph!

I figured out how to do what I was trying to do, but the result was disappointingly slower than a SPI.transfer(buffer,size) operation. Seems the Teensy is awesome in so many ways, but SPI, the DUE is superior. I will not be pursuing this particular project on the Teensy :(

Regards,

Graham

do you mind to elaborate the statement "DUE is superior" a little bit more.
I have no DUE, so could I believe you, but I would appreciate some words on why DUE is superior. Is it faster HW, of the way SW is written, or the API, or what?
 
@WMXZ

The basis for my statement was that in this particular instance, with current software implementations, the DUE is significantly faster reading data from a SPI Flash IC than the T3.6 appears to be. KurtE has patched the core so that SPI.transfer(buffer,size) is now significantly faster such that the time taken to read SPI.transfer(buffer,116736) bytes is 40861uS.

Using the following code snippet and a DMASIZE of 2048

Code:
  DMASPI0.begin();
  DMASPI0.start();
  digitalWrite(23, LOW);
  DmaSpi::Transfer trx(nullptr, 0, nullptr);
  src[0] = 3;
  src[1] = 0;
  src[2] = 0;
  src[3] = 0;
  unsigned long mt = micros();
  trx = DmaSpi::Transfer(src, 4, nullptr);
  DMASPI0.registerTransfer(trx);
  while (trx.busy())
  {
  }
  clrDest((uint8_t*)dest);
  for (int n = 0; n < 57; n++) {
    trx = DmaSpi::Transfer(nullptr, DMASIZE, dest);
    DMASPI0.registerTransfer(trx);
    while (trx.busy())
    {
    }
  }
  unsigned long mt2 = micros();
  dumpBuffer(dest, "dest: ");
  Serial.println("Time taken to read 116736 bytes =" + String(mt2 - mt) + "uS");
  DMASPI0.end();
  digitalWrite(23, HIGH);
  SPI.end();

the time taken is 257033uS.

Now we have the DUE time taken for......... 2 x 'SPI.transfer(buffer,58368);' bytes is 66777uS, however using Bill Greiman's DMA code and the following snippet
Code:
  spiBegin();
  spiInit(2);
  unsigned long mt = micros();
  digitalWrite(primary_Flash_CS, LOW);

  spiSend(0x03);
  spiSend(0x0);
  spiSend(0x0);
  spiSend(0x0);
  for (int z = 0; z < 57; z++)
    spiRec(buffer , 2048);
  digitalWrite(primary_Flash_CS, HIGH);
  unsigned long mt2 = micros();
  Serialport.println("Time taken=" + String(mt2 - mt) + "uS");


this is 22411uS.

That was the justification for my statement.

Regards,

Graham
 
The limiting factor is the SPI clock. The DUE's SPI clock can run up to 42MHz. Teensy F_BUS of 60mhz, yields max SPI clock of 30 Mhz. One could overclock the teensy BUS, though the datasheet says max SPI clock is 30mhz. At those high rates, one needs to worry about wire lengths and such. some anecdotal SPI timings
https://github.com/manitou48/DUEZoo/blob/master/SPIperf.txt
 
Last edited:
Just to clarify: DMASPI cannot give you a faster SPI, it can only free the CPU during long transfers (like more than 100 bytes) when the traditional approach would block for a long time.
 
The limiting factor is the SPI clock. The DUE's SPI clock can run up to 42MHz. Teensy F_BUS of 60mhz, yields max SPI clock of 30 Mhz. One could overclock the teensy BUS, though the datasheet says max SPI clock is 30mhz. At those high rates, one needs to worry about wire lengths and such. some anecdotal SPI timings
https://github.com/manitou48/DUEZoo/blob/master/SPIperf.txt

Yes, you can overclock both - the cpu and the bus ("F_BUS") Did you try F_BUS overclocking, manitou ? It speeds up SPI up to a factor of 2..
 
Hey there,

I am trying to use this library with hi-speed SPI-based ADCs and DACs and I am having a couple of issues.
First, I'd like to know how to modify the SPI module to always transfer 16-bit words instead of 8-bit word.
But more importantly, I'd like to know if there's a way to call select() and deselect() on the ChipSelect class at every transferred word, instead of just at the beginning and at the end of the whole buffer.
I need this because my converters need a precise chip select pin sequence and it has to be repeated at every word. I'm currently using the library to only transfer 2 bytes at a time but this makes it so inefficient, in fact the normal SPI library works better used this way.

Thanks for your feedback guys.
 
Hey there,

I am trying to use this library with hi-speed SPI-based ADCs and DACs and I am having a couple of issues.
First, I'd like to know how to modify the SPI module to always transfer 16-bit words instead of 8-bit word.
But more importantly, I'd like to know if there's a way to call select() and deselect() on the ChipSelect class at every transferred word, instead of just at the beginning and at the end of the whole buffer.
I need this because my converters need a precise chip select pin sequence and it has to be repeated at every word. I'm currently using the library to only transfer 2 bytes at a time but this makes it so inefficient, in fact the normal SPI library works better used this way.

Thanks for your feedback guys.

Short version: If you need to select and deselect between each and every word, the DMASPI library can't help you as is.

Long story: I developed the library (and the way it handles chip select) to overcome a silicon bug in the old Teensy 3.0, where it was not possible to write to the upper half of the SPI control register (which controls the hardware chip select lines) using DMA, at least not in a way that allows to run large transfers. But I needed large SPI transactions with DMA. And that's why the library exists in its current form, more or less.

This already points towards a possible solution: If your hardware doesn't have the bug described above - and to be honest I didn't pay attention to this in the more recent chips - you can use the hardware chip select lines with DMA. This means quite some work, because you have to roll your own DMA library for your ADCs, but what you want to do sounds feasible by the limited description you gave. Maybe it would even work with the old hardware if you just need a number of consecutive one-word transfers to/from your ADCs. However, this will only work if you have connected their CS lines to the Teensy's hardware chip select pins.

Please give us more detail about the CS sequence you mentioned - what needs to be asserted when?

I'm absolutely not surprised that for this application the classic SPI library is faster than DMASPI, since all the transfer and chip select handling creates some overhead.

Regards

Christoph
 
Short version: If you need to select and deselect between each and every word, the DMASPI library can't help you as is.

Please give us more detail about the CS sequence you mentioned - what needs to be asserted when?

The ADC has a CONVST pin, which I suppose stands for Conversion Start; this pin has to be held high for 1300 ns, then held low while 16 clock pulses are sent and 16 bits are received.

The DAC has a Chip Select pin (active low). It has to be held high for at least 30 ns, then low until data is sent.

Right now I'm pulling CONVST and CS high at the same time, then I wait 1300 ns with nops, then pull CONVST and CS low, then use SPI.transfer16() to transfer data from both devices at the same time.

I read on the datasheet something about CS programmable delay times, but I haven't read anything about single-word transfers using DMA.
 
The ADC has a CONVST pin, which I suppose stands for Conversion Start; this pin has to be held high for 1300 ns, then held low while 16 clock pulses are sent and 16 bits are received.

The DAC has a Chip Select pin (active low). It has to be held high for at least 30 ns, then low until data is sent.

Right now I'm pulling CONVST and CS high at the same time, then I wait 1300 ns with nops, then pull CONVST and CS low, then use SPI.transfer16() to transfer data from both devices at the same time.

I read on the datasheet something about CS programmable delay times, but I haven't read anything about single-word transfers using DMA.

I can confirm the hardware chip select lines work properly on the new Teensy 3.6. I configured the module's chip select delay after transfer in the CTAR register and I now have the required CS timings for my application.
 
Hey there,

quick update: I also got the DMASPI library to work with 16 bits transfers; to do so I modified all the "uint8_t"s to "uint16_t" and then I put this line "SPI0_PUSHR = SPI_PUSHR_CTAS(1) | SPI_PUSHR_PCS(3);" in my custom ChipSelect class, in the select() method, just after calling SPI.beginTransaction(settings);
That line sets the SPI command for the following transfers to use CTAR1 which should be already set to handle 16 bit transfers in the SPI library; the SPI_PUSHR_PCS() part is needed for my application because it tells the SPI module which hardware chip select line to toggle for each transfer.

Thanks @christoph for the work on the DMASPI lib and for the insights.
 
Currently transferring using dmaspi with a while(trx.busy())){} afterward takes longer on my OLED screen write tests than SPI.transactions.

2000micros vs 760micros

any ideas why?

Code:
#ifdef DMA_PAGE_TRANSFER
  trx = DmaSpi::Transfer(buffer, BUFFER__SIZE, nullptr);
  DMASPI0.registerTransfer(trx); 
  while(trx.busy()) {}
#else
   for (uint16_t i = 0; i<SSD1306_BUFFERSIZE; i++) {fastSPIwrite(buffer[i]);}
#endif
 
is it possibly taking time waiting on potential input from the SPI? If so, is there a way to only send and not worry about any receiving?

alternately, could this have anything to do with the 8bit transfers vs 32 bit DMA hardware?
 
tried the aforementioned technique of moving the while before my transfer and stuff flys of the rail because of the other screen commands used to set up the screen for receiving data, and if I move it before those commands, it seems as though the screen never gets any data but the timer showing my main loop time shows the same 2000us as it was taking when the screen writes were working.

Code:
void MacroMachines_SSD1306::display(void) {

#ifdef DMA_PAGE_TRANSFER
  while(trx.busy()) {}
#endif

  ssd1306_command(SSD1306_COLUMNADDR);
  ssd1306_command(0);  // Column start address (0 = reset)
  ssd1306_command(SSD1306_LCDWIDTH-1); // Column end address (127 = reset)
  ssd1306_command(SSD1306_PAGEADDR);
  ssd1306_command(0); // Page start address (0 = reset)
  ssd1306_command(7); // Page end address
  
    *csport |= cspinmask;
    *dcport |= dcpinmask;
    *csport &= ~cspinmask;

#ifdef DMA_PAGE_TRANSFER
  trx = DmaSpi::Transfer(buffer, BUFFER__SIZE, nullptr);
  DMASPI0.registerTransfer(trx); 
#else
   for (uint16_t i = 0; i<SSD1306_BUFFERSIZE; i++) {fastSPIwrite(buffer[i]);}
#endif

    *csport |= cspinmask;
}

ideally I would like to get the timing for the writes down to the time I was getting from SPI transfers alone, and then this should be a substantial performance gain overall if SPI doesn't hold up everything for 1024 screen write bytes.
 
You're not using DMASPI as intended. Some details of your code are "backwards" or circumvent the concepts behind DMASPI.

First of all, I'll try to answer the questions from your previous posts:
Currently transferring using dmaspi with a while(trx.busy())){} afterward takes longer on my OLED screen write tests than SPI.transactions.

2000micros vs 760micros

any ideas why?
I don't know how large the buffer for that test was, but if it was large enough for DMASPI to have any effect that compensates for the additional overhead, chances are that you didn't give DMASPI the right SPI settings through a ChipSelect object. These set up the SPI for the actual transaction, using classic SPI transaction methods. DMASPI builds a "transfer" on top of that. if there's no ChipSelect object to use, DMASPI will use default settings which might simply be slower than what you use in your code. Without seeing all your test code, it's hard to tell what's going on.

is it possibly taking time waiting on potential input from the SPI? If so, is there a way to only send and not worry about any receiving?

alternately, could this have anything to do with the 8bit transfers vs 32 bit DMA hardware?
If you don't pass a return buffer to your transfer object, any incoming data is discarded. This is exactly what you do in your last post, so you're good to go in that respect. SPI can, by its very basic nature, not "wait" for incoming data, it will just clock in what's present on the DIN pin.

The DMA engine can handle 8-bit transfers just fine, this has been demonstrated before and it also works for DMASPI.

No let's get to the code snippet in your last post. I'll start with some concepts behind DMASPI.
  • When a Transfer is registered with the DMASPI engine, it is appended to a queue of pending transfers.
  • When the DMASPI engine is not "stopped" (see below), it will start a pending transfer after it has been registered (if the queue was empty before) or when the previous transfer has finished
  • A transfer object can be equipped with a ChipSelect object. The ChipSelect object is responsible for selecting a slave and setting up the right SPI settings. Some basic ChipSelect classes are included with DMASPI and the most common one, "ActiveLowChipSelect", is demonstrated in the example code. Apply correct SPISettings there.
  • While a transfer is being processed, its source and destination buffers must remain in scope. Otherwise, wrong data might be sent to the slave, or (probably worse) existing data in the master's RAM might be overwritten.
  • The DMASPI engine can be stopped and restarted to have classic, blocking SPI operations between DMA transfers. You need this to properly send commands to your display. However, when stop() is called, this will just set a stop request flag and the engine will finish any transfer that is currently in progress, but it will not start any new pending transfer. This is also demonstrated in the example.

Your code starts by waiting for the current transfer to finish. This will probably waste CPU time. Simply add a method that checks for trx.busy() == false to your display class and start drawing only if that is the case. If the transfer is not busy, it's safe to send the next frame. Try to think in states, not in procedures, when writing asynchronous code. You can only get less wasted CPU time if you don't wait for the transfer to finish. Instead, check if it is finished and do something useful until it is. Check every 20 ms for a maximum of 50 fps. A fast CPU with little to draw can easily do 100 fps or more if not throttled, and the display will look like garbage while it's simply trying to keep up with your data.

On top of that, you use the SPI for sending commands to the display immediately after you found out that the transfer is not busy any more. This is probably ok for applications where there is only one module that uses the SPI, but in more complex situations the DMASPI engine might already be using the SPI for its next pending transfer. So you need to pause the DMASPI engine using stop(), and wait until it has actually stopped!

Move your chip select handling to a chip select object. This will make sure that the chip select line is only asserted just before the transfer actually starts, and that it is only deasserted when the transfer has finished. Your code, as it is now, deasserts CS immediately after registering the transfer, and - depending on the SPI speed - no or just one byte has been shifted out by that time. The rest of the data will go nowhere because the slave is not expecting any further data!

I'll not write a complete solution for your problem now because I think that others can benefit more from these explanations than from a specific solution to a specific problem. I'm well aware that SPI with DMA is a complex matter, and the library is not less complex. However, it provides a lot of the tools necessary to get going even in complex environments. I hope that this will get you a few steps further. When you provide an attempt at implementing the above hints, I'm happy to help further.
 
RE: the 2000+us with DMA vs 760us using regular SPI / activelow chip select.
Ill try using the recomended chip select object out right now and see where I get. That makes sense it would help to have the CS and SPIsettings wrapped in a DMASPI abstraction, but I couldn't seem to figure out what the difference would be through reading the implementation

I understand the benefits of DMA are mainly to allow the processor to continue doing other things while the DMA pushes out the 1024 byte screen buffer. I wanted to get DMA working first before moving on to making use of the freed cycles. Unfortunately I am also using the single SPI for DAC writes, which makes timing a bit more critical.

Is it possible to have the DAC ISR suspend the DMA transfer for a 4 channel * 16 bit DAC write (8 bytes)? I don't currently think that's possible but you have far more knowledge of SPI&DMA. From what I gather stop() might be a partial solution?

I do know from another project I studied that uses similar hardware that there is a net gain to be had if I get the timing right. At the moment I am thinking if I could make the draw method only refresh changed regions, which are often very small areas, I should be able to write the OLED changes quickly with DMA while preparing the next frame.

I will see if I can cook my test down further to just the essentials.
 
Oh nice! I added an ActiveLowChipSelect object with my appropriate 32mhz spi settings along with a couple other tweaks streamlining the commands prior to the DMA transfer and I am getting 400us loop time. thats almost 2x faster than my non DMA writes. Still using the while(busy) to hold things after the transfer, but this shows the expected level of potential speed gain to be exciting.

Should I also add my other commands into the chip select object?

I also noticed a mention of pause/continue in the example file. Does this potentially mean my ISR for the DAC writes on the same SPI line could pause the DMA, flip the screen CS high, transfer the DAC words, and resume the screen DMA?
 
Should I also add my other commands into the chip select object?
The display SPI commands with different pin states (DC pin)? You can do that. However, keep in mind that the DMASPI transfer might be started in an ISR context, so this might delay execution of other time critical parts of the code. Only you can judge if this is ok. Another approach is to stuff the command part into its own DMASPI transfer with another custom chip select object that takes care of CS and DC, and making sure that the command transfer is always registered with the DMASPI engine before the dataframe is registered (this is easily done).

I also noticed a mention of pause/continue in the example file. Does this potentially mean my ISR for the DAC writes on the same SPI line could pause the DMA, flip the screen CS high, transfer the DAC words, and resume the screen DMA?

Stopping the DMASPI engine using stop() will first wait for a running low-level DMA transfer to finish - it's more like a "request to stop when things are done", in order not to confuse slaves with an unfinished transfer. This is usually safer than simply aborting in the middle of a command, for example. A data frame for a display might be less critical. Aborting the low-level DMA transfer at some random point is not supported by the library yet, but in theory it's possible - I haven't tried, though. If you dig into the datasheet (hint: for example teensy 3.2 chip datasheet -> chapter 21: "DMA Controller" -> DMA_CR register -> CX bit), come up with an abort() command for DMASPI and file a pull request I'll definitely consider it. The hard part is to cancel only the transfer that is supposed to be canceled, and none other.

So some more questions:
  • How time critical is your DAC code?
  • Do you need to write to the DAC at regular intervals or is it a more or less random process? If it's regular intervals, wait for the DAC stuff to finish and then check if you need to update the display. This would relax things a lot.
  • Is there enough time for the display update between any two DAC writes?
  • Can you move DAC and display to different SPIs if this doesn't work out as expected
  • Which teensy variant are we talking about?

Regards

Christoph
 
Last edited:
If you dig into the datasheet (hint: for example teensy 3.2 chip datasheet -> chapter 21: "DMA Controller" -> DMA_CR register -> CX bit), come up with an abort() command for DMASPI and file a pull request I'll definitely consider it. The hard part is to cancel only the transfer that is supposed to be canceled, and none other.


I will look into that if the need arises. Thank you for the hint :p


[*]How time critical is your DAC code?

Its not what I would call highly critical. Before, was plenty fine for its intended purposes without DMA, but there were little moments with the scope zoomed in that you could see the DAC holding for a couple ms while the screen writes were happening. It was sounding good as an audio oscillator, but the quality improved dramatically when I would disable the screen refresh, however disabling the screen refresh took a fair bit away from the dynamic visual feedback of the screen. I am going to try pulling my DMA tests back into the main code now and see if my various ideas for reducing the screen write time & DMA remove the little inconsistencies in the writes.


[*]Do you need to write to the DAC at regular intervals or is it a more or less random process? If it's regular intervals, wait for the DAC stuff to finish and then check if you need to update the display. This would relax things a lot.

it would be ideal to use my ISR based DAC writes, to get consistent timing and smooth output, I had tried out a few ways of managing timing aside from the ISR. Checking elapsedMillis to see if it was time to draw another frame and letting the DAC / ADC run as fast as possible worked fairly well.


[*]Is there enough time for the display update between any two DAC writes?

I believe it will be if I succeed in some of my optimizations. Mainly the thing that is continually moving is an indicator circle moving along a waveform, and if I were to have it double buffered, and only drew where it needed to be redrawn, that would often be only 1/64th of the screen, which from my tests of drawing the full screen using DMA, should be well within the time between DAC writes. Also I am going to try out the ADC library implementation of ring buffer/DMA reads which would potentially alleviate any concerns about getting time critical events from the ADC inputs.


[*]Can you move DAC and display to different SPIs if this doesn't work out as expected

Unfortunately the Teensy 3.1 / Freescale chip only has the one, and the hardware is done for this. Its not a critical thing, more of a huge bonus. There is another similar project using the same chip, screen, and dac scenario which appears to work very well using DMA & SPI.

[*]Which teensy variant are we talking about?

3.1 was used in the development on this project, but my next I plan to use the 3.6 on the next. the most exciting thing of the 3.6 for me is the FPU. there are multiple SPI on there which would be a huge plus, possibly better yet could be that there are enough pins to do parallel screen writes. Do you know if it would it be possible to use DMA on 8 pin parallel on 3.6? I believe I have spoken with some STM32 users that did this, and I had been considering switching to STM32 for my next projects until the 3.6 got announced. The only thing still drawing my potential interest to STM32 is the insane smart phone style screen dev board. But the Nextion could be a decent middle ground.
 
Last edited:
I have one critical question, I have most of what I need set up to do partial screen writes in my DMASPI Display method now, but I realized the Transfer didn't have a way to offset the position into the buffer for the partial write. What would be the best way for me to go about this? I will do my best to parse your code and figure it out, but I figured there might be some DMA or pointer tricks outside my knowledge that would be needed/optimal.
 
So as I understand it, your DAC writes have to be done at regular intervals?

Regarding those partial screen writes: DMASPI will only be able to digest plain arrays. So if you have a full-screen buffer with a number of full lines (say 64x128 pixels) and only want to write a window (say 16x16 pixels) from that buffer, your array is actually a list of 16-pixel lines and not a single array of consecutive pixels. However, once you know your drawing window, you can prepare such an array. Does this answer your question about "offsetting the position into the buffer for the partial write"? If it's simpler than that and you really just need an offset into the buffer, you can specify that when you create your transfer object. It's certainly also possible to write a method that sets the source adress just like the constructor does. Such a method should then take the transfer's state into account and only update the source field when it is not pending or in progress.

Regards

Christoph
 
Hello guys, I just did a try and found that::
modifying the DmaSpi.h ->" line 494 " can change the default spisettings,

like below::
m_Spi.beginTransaction(SPISettings(16000000 , MSBFIRST, SPI_MODE0));

and that works with SPI1, on teensy LC
I uses SPI1 without using CS, or should say it was constantly selected , not sure if it helps anyone?
 
So as I understand it, your DAC writes have to be done at regular intervals?

The DAC writes do not have to be done at regular intervals, but it does make things sound a bit better if used as direct audio source. I have 2 implementations of my DAC writing function that I flip back and forth between in different scenarios. One is an ISR driven by IntervalTimer, and the other is in the main loop happing whenever it can. This will likely have to change a bit now to accommodate the timing flexibility of DMA, and I plan to try this right now. It may even potentially be better to use DmaSpi for the DAC as well now, to keep things from colliding more readily. I figure if I have a small 4channel*2word DAC command buffer sent via DmaSpi it could potentially be the fastest option.

Are you aware of any potential conflicts using the ADC library DMA functionality?

Regarding those partial screen writes: DMASPI will only be able to digest plain arrays. So if you have a full-screen buffer with a number of full lines (say 64x128 pixels) and only want to write a window (say 16x16 pixels) from that buffer, your array is actually a list of 16-pixel lines and not a single array of consecutive pixels. However, once you know your drawing window, you can prepare such an array. Does this answer your question about "offsetting the position into the buffer for the partial write"? If it's simpler than that and you really just need an offset into the buffer, you can specify that when you create your transfer object. It's certainly also possible to write a method that sets the source adress just like the constructor does. Such a method should then take the transfer's state into account and only update the source field when it is not pending or in progress.

That is what I was thinking would be the case. So a potential solution could be a separate function that pushes out a 16x16 buffer for section writes, that is copied from the main 128x64 buffer using memcpy? (I have a sketch of a DMA based memcpy as well that could be useful, or if you have any recommendations). Alternately I could have the indicator dot render to a separate buffer with variables for the region of the OLED's ram to update and sort of layer things, with that smaller 16x16 buffer getting pushed out every frame and when a more substantial screen change is made it could run a full screen update function.
 
Status
Not open for further replies.
Back
Top