Teensy 3.1 DMA with SPI ... small bug?? Not quite working :-(

hoek67 · Jun 11, 2017

Hi, I thought I had DMA/SPI on the Teensy 3.1 working however noticed some graphics glitches that went away if I used any of the other SPI options I have including Teensy FIFO without DMA.

After sending several pages of bitmap data to an 1306 OLED I noticed the last few (3-4) bytes are usually garbled and get missed out.

If I send X bytes one at a time it works but sending via a buffer seems to have a problem.

I basically adapted some of this from a previous post :-

Code:

#include "DMAChannel.h"

DMAChannel dmachannel;

uint8_t data[256];

void setup() {
  // Setup the SPI clocks and pin configurations
  SPI.begin();

  // Setup SPI for DMA transfer
  SPI0_SR = 0xFF0F0000;
  SPI0_RSER = 0x00;
  SPI0_RSER = SPI_RSER_TFFF_RE | SPI_RSER_TFFF_DIRS; // Make sure SPI triggers a DMA transfer after each transmit
  
  dmachannel.sourceBuffer(data, 256); // The data for which we wish to transmit and it's length
  dmachannel.destination((volatile uint8_t&)SPI0_PUSHR); // Move data into the SPI FIFO register
  dmachannel.triggerAtHardwareEvent(DMAMUX_SOURCE_SPI0_TX); // Only transfer data once the previous byte has been transmited (This is to ensure all bytes are sent)
  dmachannel.disableOnCompletion(); // Stop after transmitting all 256 bytes

  dmachannel.enable(); // Begin transmit
}

void loop() {
  // put your main code here, to run repeatedly:

}

Notice after the enable it just runs off and starts the transfer but it doesn't check if it's complete.

I have a function that checks... that seems to almost work. I suspect now it waits for the DMA to finish but it may still have 1 - 4 bytes in the SPI FIFO. Seems very coincidental the number of bytes "missing" is anywhere 0 - 4 bytes by the looks.

Code:

__INLINE__ void cDMA_spi_send_do_wait_buffer()
	{
		while (!DMAtx.complete()) { } // wait for dma
		
		// now need to wait for last bytes of SPI to finish
	
		//delayMicroseconds(5);
				
		SPI0_RSER = 0;
		SPI0_SR = 0xFF0F0000;
	}

The function that's mainly working is (note a lot of the setup code usually outside main loop... but bought in while trying to resolve problem) :-

Code:

__INLINE__ void cDMA_spi_send(void *buf, uint32_t n) // this allows for buffers larger than what DMA usually can handle by breaking them down
	{
		uint8_t *buff = (uint8_t *)buf;
		
		while (n)
		{
			uint32_t bytes = (n <= 1024) ? n : 1024;  // need find the actual max... when have the rest working

			DMAtx.destination((volatile uint8_t &)SPI0_PUSHR);
			DMAtx.triggerAtHardwareEvent(DMAMUX_SOURCE_SPI0_TX);
			DMAtx.disableOnCompletion();

			SPI0_SR = 0xFF0F0000;
			SPI0_RSER = SPI_RSER_RFDF_RE | SPI_RSER_RFDF_DIRS | SPI_RSER_TFFF_RE | SPI_RSER_TFFF_DIRS;

			DMAtx.sourceBuffer((volatile const uint8_t *)buff, bytes);
		   	DMAtx.enable();	 // go!

			n -= bytes;
			buff += bytes;
			
			cDMA_spi_send_do_wait_buffer();				
		}
	}

** I put a small delay before disabling CS and it "worked" as it gave it enough time to complete the last few bytes. Just need a way to determine this in code.

KurtE · Jun 11, 2017

Yes the DMA completion on the TX will happen when the last byte is transferred into the SPI queue. As the queue is 4 elements deep, that is the issue you are running into.

Ran into like stuff when working on SPI library to support ASYNC transfers. And one of my test cases is I converted both the Adafruit and the Sparkfun libraries for the SSD1306 over to use my Async support... Earlier I had the code directly in that library...

I have had two different ways to fix this.

Current way in the SPI library, when I am doing what I would call a write only transfer... I.e I don't care what comes back... Which is your case. Also note the solution may be different depending on processor (Mainly T3.x versus TLC and T3.5 SPI1/2 are special cases)...
My version of SPI library code is up at: https://github.com/KurtE/SPI/blob/SPI-Multi-one-class/src/SPIKinetis.cpp#L539
I have it highlight at start of some of the Async stuff.

a) Current solution here: setup two DMAChannels. The one you have for TX, and the other for RX. The RX source is SPI0_POPR and the destination is a one byte dummy variable. You use destination instead of destinationBuffer and you may have to set the count manually... Note there is bug in current transferCount function where if you tell it to send 6 bytes , followed by 512 bytes, and then 6 bytes again it will screw up... Details in Pull request, also work around function in my source file...
So I set the DMArx to disable on completion and interrupt on completion... It is in this interrupt that I change the CS pin, Which will happen after the proper number of bytes have been sent and response processed...

b) The other solution was to stick with only TX buffer, but have it send DMA count of count -1, and set it to ISR at completion. Then in this ISR I PUSHR the last byte, with the EOQF flag and I setup to have an SPI iSR which is triggered on the EOQ status, which then disables the CS. This was my first solution as it worked on T3.5 which only has one DMA SOURCE for SPI1 or SPI2... but did a different solution later for that.

Hope that helps

hoek67 · Jun 11, 2017

Ahhh thanks for the quick info... have to go for a few hours but will have a look when I get back.

Such a bummer so to speak to be so near.... 5% of time to get 95% of the code... and extra 95% of time to get last 5% syndrome strikes again I guess. Thing is... got a good enough look to see it's worth doing.... such a huge speed increase.

hoek67 · Jun 12, 2017

OK... went for a low tec solution.

Basically for the transfers I added a small delay to wait after the last DMA transfer happens. I Could get fancy and try and calculate a delay given the start and end times and the number of bytes but we're really talking about a very small delay. As an experiment 4uS is used after a block transfer regardless of size as I can't determine if 0,1,2,3 or 4 bytes are in the FIFO so have to allow for worst case every time.

The thing is... even with these small delays the DMA transfers are very fast.

Code:

__INLINE__ void cDMA_spi_send_do_wait_buffer()
	{
		while (!DMAtx.complete()) { } // wait for dma
	
		// now need to wait for last bytes of SPI to finish... even though DMA has finished sending the SPI FIFO has 1+ to send

		delayMicroseconds(__spi_delay); // bit of a hack for now... perm fix as one comes along
								
		SPI0_RSER = 0;
		SPI0_SR = 0xFF0F0000;
	}

When I get things integrated a bit better I hope to share what I have.

KurtE · Jun 12, 2017

Sounds good. The real question to me, is are you doing this to improve the speed of the output and/or to do this asynchronous, such that you can do other things while it is updating.

As I mentioned earlier, with my Async SPI code and test version of the Sparkfun_teensyview library (SSD1306) display, I have test code setup where I install one display on each of the SPI busses of the boards, and run test code which does screen updates on the displays. As for the T3.2... One SPI... But tested on TLC 2, T3.5/6 3....

Now if you simply want faster updates. You can also achieve this by simply keeping the SPI FIFO queue full, in much the same way as the ILI9341_t3 library does.

I have a version that does this: https://github.com/KurtE/SparkFun_TeensyView_Arduino_Library/tree/SPI-Only-Version/src
This is the Sparkfun library...

And you can actually do reasonably well on current versions of code if you simply convert to instead of using SPI.transfer(x), to using SPI.transfer(buf, cnt)... Wish the proposed version of
SPI.transfer(buf, retbuf, cnt) was approved to be in mainline SPI library, as the current transfer(buf, cnt) destroys the passed in buf by overwriting the data with data returned from transfers.

So I had a version of the update screen function, that looped copying the data into temp buffer, write out temp buffer. Yes left gaps between groups, but still gained a lot...

As for how much you can send in one DMAChannel transfer... Most cases 32767 units (in your case bytes, but could be words)... Unless you have DMA Channel linking in which case the max is 511

hoek67 · Jun 12, 2017

I'm mainly making the changes for speed but also so I can utilize pixel processing instead of spinning waiting and doing nothing.

511 max... that prob explains why the 1351 where I try and pump 1k, 2k, 4k,32k in 1 hit looks like a warzone.

The Due seems to handle 3k chunks.

I do have a non-DMA version that is working and just keeps the FIFO buffers fed. There is a speedup and it does have the ability to weave some cpu cycles between sends but DMA is just so much faster again.

I have a SPI base class and it uses the "best" code at compile time so all the dma functionality ends up inlined and very little overhead for setup.

If I can get the buffered DMA going I should be able to get Teensy to run my "video with sound" example.

So... is 511 bytes 0-511 or 511 actual bytes? Was thinking about this as the block read for SD is always 512 so would be a bummer if I can't read a block in a single hit.

KurtE · Jun 12, 2017

Your code does not link channels, so max transfer is 32767. I have had a version for those displays that just does two dma operations. First one 6 bytes with D.C. Asserted, when I received the completion sir it unasserted dc, then sends the data, on 512 or 1024 depending on display.

Teensy 3.1 DMA with SPI ... small bug?? Not quite working :-(

hoek67

Active member

KurtE

Senior Member+

hoek67

Active member

hoek67

Active member

KurtE

Senior Member+

hoek67

Active member

KurtE

Senior Member+