Cache handling - DMA transfers


New member
I have a function that interleaves two datasets into a buffer that gets transmitted via DMA.

inline void packCmplx(int32_t *pCmplx, const int32_t *pStop, const int16_t *re, const int16_t *im)
		SCB_CACHE_DCCIMVAC = (uint32_t)pCmplx;
		__asm__ volatile("dsb");         
		*pCmplx++ = (*re++) << 8;
		*pCmplx++ = (*im++) << 8;

		*pCmplx++ = (*re++) << 8;
		*pCmplx++ = (*im++) << 8;

		*pCmplx++ = (*re++) << 8;
		*pCmplx++ = (*im++) << 8;
		*pCmplx++ = (*re++) << 8;
		*pCmplx++ = (*im++) << 8;
	} while (pCmplx < pStop);

This seems to work correctly, but I'm not sure why- or whether or not its the best approach.

Shouldn't I need an ISB instruction following DSB since the store address is being written to DCCIMVAC in-loop? Would a single call to arm_dcache_flush_delete() after interleaving yield any advantages?
Did a little experimentation and used a scope and digitalWriteFast to come up with timings for various cache maintenance operations.

The approach given in the first post, where a write to DCCIMVAC followed by DSB is performed every 32 bytes, varies quite a bit in performance depending on optimization

I varied the optimization options local to packCmplx() with the push, pop, optimize pragmas:
#pragma GCC optimize ("OX")
inline void packCmplx(int32_t *pCmplx, const int32_t *pStop, const int16_t *re, const int16_t *im)
{ ...
and am interleaving two 512-byte arrays into pCmplx, (the array pointed to by SADDR).

-OXPulse Width (uS)

I also measured with an ISB added after DSB

-OXPulse Width (uS)

I then removed cache handling from packCmplx() and added a call to arm_dcache_flush_delete() in the ISR that calls packCmplx(), allowing it to be reduced to:
#pragma GCC optimize ("OX")
inline void packCmplx(int32_t *pCmplx, const int32_t *pStop, const int16_t *re, const int16_t *im)
	while (pCmplx < pStop)
		*pCmplx++ = (*re++) << 8;
		*pCmplx++ = (*im++) << 8;

This consistently gave times around 2.20 uS regardless of optimization levels.

Just for fun, I wanted to see what kind of performance I'd get without doing any cache maintenance. I used SCB_DisableDCache() from core_cm7.h (requires slight modification) to turn off dcache and measured these times using the same function as above:

-OXPulse Width (uS)