Cache handling - DMA transfers

jasonconway

New member
I have a function that interleaves two datasets into a buffer that gets transmitted via DMA.

Code:
inline void packCmplx(int32_t *pCmplx, const int32_t *pStop, const int16_t *re, const int16_t *im)
{
	do
	{
		SCB_CACHE_DCCIMVAC = (uint32_t)pCmplx;
		__asm__ volatile("dsb");         
 
		*pCmplx++ = (*re++) << 8;
		*pCmplx++ = (*im++) << 8;

		*pCmplx++ = (*re++) << 8;
		*pCmplx++ = (*im++) << 8;

		*pCmplx++ = (*re++) << 8;
		*pCmplx++ = (*im++) << 8;
		
		*pCmplx++ = (*re++) << 8;
		*pCmplx++ = (*im++) << 8;
	} while (pCmplx < pStop);
}

This seems to work correctly, but I'm not sure why- or whether or not its the best approach.

Shouldn't I need an ISB instruction following DSB since the store address is being written to DCCIMVAC in-loop? Would a single call to arm_dcache_flush_delete() after interleaving yield any advantages?
 
Did a little experimentation and used a scope and digitalWriteFast to come up with timings for various cache maintenance operations.

The approach given in the first post, where a write to DCCIMVAC followed by DSB is performed every 32 bytes, varies quite a bit in performance depending on optimization

I varied the optimization options local to packCmplx() with the push, pop, optimize pragmas:
Code:
#pragma GCC optimize ("OX")
inline void packCmplx(int32_t *pCmplx, const int32_t *pStop, const int16_t *re, const int16_t *im)
{ ...
and am interleaving two 512-byte arrays into pCmplx, (the array pointed to by SADDR).

-OXPulse Width (uS)
-O06.84
-O15.51
-O216.56
-O316.79

I also measured with an ISB added after DSB

-OXPulse Width (uS)
-O07.04
-O15.74
-O217.00
-O317.22

I then removed cache handling from packCmplx() and added a call to arm_dcache_flush_delete() in the ISR that calls packCmplx(), allowing it to be reduced to:
Code:
#pragma GCC optimize ("OX")
inline void packCmplx(int32_t *pCmplx, const int32_t *pStop, const int16_t *re, const int16_t *im)
{
	while (pCmplx < pStop)
	{
		*pCmplx++ = (*re++) << 8;
		*pCmplx++ = (*im++) << 8;
	}
}

This consistently gave times around 2.20 uS regardless of optimization levels.



Just for fun, I wanted to see what kind of performance I'd get without doing any cache maintenance. I used SCB_DisableDCache() from core_cm7.h (requires slight modification) to turn off dcache and measured these times using the same function as above:

-OXPulse Width (uS)
-O010.69
-O18.75
-O28.95
-O38.96
 
Back
Top