Forum Rule: Always post complete source code & details to reproduce any issue!
Results 1 to 13 of 13

Thread: Partitoned Convolution and EXTMEM perfromance

  1. #1

    Partitoned Convolution and EXTMEM perfromance

    I am trying to modify bmillier's partitioned convolution filter to use EXTMEM arrays on Teensy 4.1 for storing its large internal buffers but I am running into severe performance problems with it.

    I can move the IR buffer and maskgen arrays to EXTMEM with no significant performance impact, basically because these are only used to compute the fmask content.

    However, any attempt to use EXTMEM for the fftout and fmask buffers doesn't work with buffers over 64K- it uses more CPU than available.

    Is there any way to improve EXTMEM performance. Is Dual SPI/QSPI enabled by default for EXTMEM?

    Is there some way to cache the contents of these buffers in fast main memory that would reduce/eliminate the use of EXTMEM on every read/write as part of the partitioned convolution algorithm? I don't understand the algorithm well enough to see how to introduce caching.

  2. #2
    Senior Member DD4WH's Avatar
    Join Date
    Oct 2015
    Location
    Central Europe
    Posts
    676
    Hi Dirk,

    when I changed the code to work with T4.1, I also used two PSRAM chips and checked whether that would work.

    it does not work!

    The reason is not the CPU power, which the T4 has more than enough, but the speed of the PSRAM, which is way too slow!

    If you look at the code for the partitioned convolution filtering, it has an inner loop which makes multiple accesses to large EXTMEM (PSRAM) variables necessary in very small fractions of time.

    Sorry to say, but PSRAM for partitioned convolution filtering is much too slow.

    Here is the code I tried in May 2020:

    https://forum.pjrc.com/threads/60886...l=1#post239417
    Last edited by DD4WH; 09-25-2020 at 11:56 AM.

  3. #3
    Quote Originally Posted by DD4WH View Post
    Hi Dirk,
    Sorry to say, but PSRAM for partitioned convolution filtering is much too slow.

    Here is the code I tried in May 2020:

    https://forum.pjrc.com/threads/60886...l=1#post239417
    That's a real shame. I was hoping somebody would find the PSRAM is only running on 1 SPI bus or something similar and we could double/quadruple the speed with Dual/Quad SPI.

    Or we could modify the algorithm so it had greater locailty or only needed to touch a limited region of the two buffers in one update() pass so you could cache parts of the buffers in fast RAM.

  4. #4
    Senior Member+ mjs513's Avatar
    Join Date
    Jul 2014
    Location
    New York
    Posts
    5,640
    The PSRAM runs at a default speed of 88Mhz which is set up in the core startup.c. However, you can change this to run at its max speed of 132Mhz if you want to try it by add the following in setup before accessing the PSRAM:
    Code:
    	  // turn on clock  (TODO: increase clock speed later, slow & cautious for first release)
    	  CCM_CCGR7 |= CCM_CCGR7_FLEXSPI2(CCM_CCGR_OFF);
    	  CCM_CBCMR = (CCM_CBCMR & ~(CCM_CBCMR_FLEXSPI2_PODF_MASK | CCM_CBCMR_FLEXSPI2_CLK_SEL_MASK))
    		  | CCM_CBCMR_FLEXSPI2_PODF(4) | CCM_CBCMR_FLEXSPI2_CLK_SEL(2); 
    	  CCM_CCGR7 |= CCM_CCGR7_FLEXSPI2(CCM_CCGR_ON);

  5. #5
    Senior Member
    Join Date
    May 2015
    Location
    USA
    Posts
    649
    I just tried it and it worked well. Much faster and passed all tests. But probably not fast enough for Dirk.

  6. #6
    I tried it as well in both the memory test program and my partitioned convolution code.
    It's faster, but not fast enough, unfortunately.
    Are we sure it's using multiple SPI 'lanes' for each chip?

  7. #7
    Senior Member PaulStoffregen's Avatar
    Join Date
    Nov 2012
    Posts
    22,767
    Quote Originally Posted by dirkenstein View Post
    Are we sure it's using multiple SPI 'lanes' for each chip?
    Yes, very sure it's really using 4 bit mode.

    We're also using the PSRAM's fastest (not default) mode where the initial command is transmitted with all 4 bits.

  8. #8
    Pity, I was hoping there was a 'magic' way to make it work.
    I really liked the idea of convolution reverb in a tiny box.
    As a cab sim it sounds really good with some quality IR wav files.
    Reorganising the algorithm for more locality and caching doesn't look feasible given that adding a simple memcpy to a buffer nearly doubles the CPU usage even without using EXTMEM.

  9. #9
    Senior Member
    Join Date
    May 2015
    Location
    USA
    Posts
    649
    You can review your sampling rate and your need for low frequency resolution to use less memory. Or go to something like a Raspberry Pi Zero.

  10. #10
    Junior Member
    Join Date
    Oct 2020
    Posts
    2

    dumb question if 2 spi RAM chips

    Quote Originally Posted by PaulStoffregen View Post
    Yes, very sure it's really using 4 bit mode.

    We're also using the PSRAM's fastest (not default) mode where the initial command is transmitted with all 4 bits.
    So is that 4-bits per chip/ Are the two chips in parallel or serial? I assume you would write 3 bytes for an address the looks like 1 or 2 bytes of data 12 clock cycles according to the datasheet on the SPI SRAM.

    I am curious is this mapped like some of the old school 16-bit data bus ones of writing to both chips at the same time, or writing to one or the other depending on where in the 16 MB we are writing to?

    Thanks.

  11. #11
    Senior Member+ defragster's Avatar
    Join Date
    Feb 2015
    Posts
    12,412
    The connect lines are common/parallel between the two chips as the processor has just those lines for the QSPI access connection exposed. Only the Chip Select is unique for the two chips.

    Those lines and processor support 'generally' allow hardware control for direct address mapping I/O for Flash read and PSRAM read/write - with helper code needed for Flash write. But the processor controls the CS pins based on address.

  12. #12
    Junior Member
    Join Date
    Oct 2020
    Posts
    2

    makes sense

    Quote Originally Posted by defragster View Post
    The connect lines are common/parallel between the two chips as the processor has just those lines for the QSPI access connection exposed. Only the Chip Select is unique for the two chips.

    Those lines and processor support 'generally' allow hardware control for direct address mapping I/O for Flash read and PSRAM read/write - with helper code needed for Flash write. But the processor controls the CS pins based on address.
    Thank you for clarifying that. I thought that was the case. That is really helpful.

  13. #13
    Senior Member
    Join Date
    Jul 2020
    Posts
    174
    Is there a way to connect some faster external RAM that would use the DMA controller? 12 address lines would get you to 8MB, and then you could use 8 or 16 lines to transfer data.

    Alternately, is there some alternate QSPI memory that can run faster?

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •