Forum Rule: Always post complete source code & details to reproduce any issue!
Results 1 to 2 of 2

Thread: Cache handling - DMA transfers

  1. #1
    Junior Member
    Join Date
    Feb 2021
    Posts
    3

    Cache handling - DMA transfers

    I have a function that interleaves two datasets into a buffer that gets transmitted via DMA.

    Code:
    inline void packCmplx(int32_t *pCmplx, const int32_t *pStop, const int16_t *re, const int16_t *im)
    {
    	do
    	{
    		SCB_CACHE_DCCIMVAC = (uint32_t)pCmplx;
    		__asm__ volatile("dsb");         
     
    		*pCmplx++ = (*re++) << 8;
    		*pCmplx++ = (*im++) << 8;
    
    		*pCmplx++ = (*re++) << 8;
    		*pCmplx++ = (*im++) << 8;
    
    		*pCmplx++ = (*re++) << 8;
    		*pCmplx++ = (*im++) << 8;
    		
    		*pCmplx++ = (*re++) << 8;
    		*pCmplx++ = (*im++) << 8;
    	} while (pCmplx < pStop);
    }
    This seems to work correctly, but I'm not sure why- or whether or not its the best approach.

    Shouldn't I need an ISB instruction following DSB since the store address is being written to DCCIMVAC in-loop? Would a single call to arm_dcache_flush_delete() after interleaving yield any advantages?

  2. #2
    Junior Member
    Join Date
    Feb 2021
    Posts
    3
    Did a little experimentation and used a scope and digitalWriteFast to come up with timings for various cache maintenance operations.

    The approach given in the first post, where a write to DCCIMVAC followed by DSB is performed every 32 bytes, varies quite a bit in performance depending on optimization

    I varied the optimization options local to packCmplx() with the push, pop, optimize pragmas:
    Code:
    #pragma GCC optimize ("OX")
    inline void packCmplx(int32_t *pCmplx, const int32_t *pStop, const int16_t *re, const int16_t *im)
    { ...
    and am interleaving two 512-byte arrays into pCmplx, (the array pointed to by SADDR).

    -OX Pulse Width (uS)
    -O0 6.84
    -O1 5.51
    -O2 16.56
    -O3 16.79

    I also measured with an ISB added after DSB

    -OX Pulse Width (uS)
    -O0 7.04
    -O1 5.74
    -O2 17.00
    -O3 17.22

    I then removed cache handling from packCmplx() and added a call to arm_dcache_flush_delete() in the ISR that calls packCmplx(), allowing it to be reduced to:
    Code:
    #pragma GCC optimize ("OX")
    inline void packCmplx(int32_t *pCmplx, const int32_t *pStop, const int16_t *re, const int16_t *im)
    {
    	while (pCmplx < pStop)
    	{
    		*pCmplx++ = (*re++) << 8;
    		*pCmplx++ = (*im++) << 8;
    	}
    }
    This consistently gave times around 2.20 uS regardless of optimization levels.



    Just for fun, I wanted to see what kind of performance I'd get without doing any cache maintenance. I used SCB_DisableDCache() from core_cm7.h (requires slight modification) to turn off dcache and measured these times using the same function as above:

    -OX Pulse Width (uS)
    -O0 10.69
    -O1 8.75
    -O2 8.95
    -O3 8.96

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •