Did a little experimentation and used a scope and digitalWriteFast to come up with timings for various cache maintenance operations.
The approach given in the first post, where a write to DCCIMVAC followed by DSB is performed every 32 bytes, varies quite a bit in performance depending on optimization
I varied the optimization options local to packCmplx() with the push, pop, optimize pragmas:
Code:
#pragma GCC optimize ("OX")
inline void packCmplx(int32_t *pCmplx, const int32_t *pStop, const int16_t *re, const int16_t *im)
{ ...
and am interleaving two 512-byte arrays into pCmplx, (the array pointed to by SADDR).
-OX |
Pulse Width (uS) |
-O0 |
6.84 |
-O1 |
5.51 |
-O2 |
16.56 |
-O3 |
16.79 |
I also measured with an ISB added after DSB
-OX |
Pulse Width (uS) |
-O0 |
7.04 |
-O1 |
5.74 |
-O2 |
17.00 |
-O3 |
17.22 |
I then removed cache handling from packCmplx() and added a call to arm_dcache_flush_delete() in the ISR that calls packCmplx(), allowing it to be reduced to:
Code:
#pragma GCC optimize ("OX")
inline void packCmplx(int32_t *pCmplx, const int32_t *pStop, const int16_t *re, const int16_t *im)
{
while (pCmplx < pStop)
{
*pCmplx++ = (*re++) << 8;
*pCmplx++ = (*im++) << 8;
}
}
This consistently gave times around 2.20 uS regardless of optimization levels.
Just for fun, I wanted to see what kind of performance I'd get without doing any cache maintenance. I used SCB_DisableDCache() from core_cm7.h (requires slight modification) to turn off dcache and measured these times using the same function as above:
-OX |
Pulse Width (uS) |
-O0 |
10.69 |
-O1 |
8.75 |
-O2 |
8.95 |
-O3 |
8.96 |