Partitoned Convolution and EXTMEM perfromance

Status
Not open for further replies.

dirkenstein

Active member
I am trying to modify bmillier's partitioned convolution filter to use EXTMEM arrays on Teensy 4.1 for storing its large internal buffers but I am running into severe performance problems with it.

I can move the IR buffer and maskgen arrays to EXTMEM with no significant performance impact, basically because these are only used to compute the fmask content.

However, any attempt to use EXTMEM for the fftout and fmask buffers doesn't work with buffers over 64K- it uses more CPU than available.

Is there any way to improve EXTMEM performance. Is Dual SPI/QSPI enabled by default for EXTMEM?

Is there some way to cache the contents of these buffers in fast main memory that would reduce/eliminate the use of EXTMEM on every read/write as part of the partitioned convolution algorithm? I don't understand the algorithm well enough to see how to introduce caching.
 
Hi Dirk,

when I changed the code to work with T4.1, I also used two PSRAM chips and checked whether that would work.

it does not work!

The reason is not the CPU power, which the T4 has more than enough, but the speed of the PSRAM, which is way too slow!

If you look at the code for the partitioned convolution filtering, it has an inner loop which makes multiple accesses to large EXTMEM (PSRAM) variables necessary in very small fractions of time.

Sorry to say, but PSRAM for partitioned convolution filtering is much too slow.

Here is the code I tried in May 2020:

https://forum.pjrc.com/threads/60886-Teensyduino-1-52-Beta-6?p=239417&viewfull=1#post239417
 
Last edited:
Hi Dirk,
Sorry to say, but PSRAM for partitioned convolution filtering is much too slow.

Here is the code I tried in May 2020:

https://forum.pjrc.com/threads/60886-Teensyduino-1-52-Beta-6?p=239417&viewfull=1#post239417

That's a real shame. I was hoping somebody would find the PSRAM is only running on 1 SPI bus or something similar and we could double/quadruple the speed with Dual/Quad SPI.

Or we could modify the algorithm so it had greater locailty or only needed to touch a limited region of the two buffers in one update() pass so you could cache parts of the buffers in fast RAM.
 
The PSRAM runs at a default speed of 88Mhz which is set up in the core startup.c. However, you can change this to run at its max speed of 132Mhz if you want to try it by add the following in setup before accessing the PSRAM:
Code:
	  // turn on clock  (TODO: increase clock speed later, slow & cautious for first release)
	  CCM_CCGR7 |= CCM_CCGR7_FLEXSPI2(CCM_CCGR_OFF);
	  CCM_CBCMR = (CCM_CBCMR & ~(CCM_CBCMR_FLEXSPI2_PODF_MASK | CCM_CBCMR_FLEXSPI2_CLK_SEL_MASK))
		  | CCM_CBCMR_FLEXSPI2_PODF(4) | CCM_CBCMR_FLEXSPI2_CLK_SEL(2); 
	  CCM_CCGR7 |= CCM_CCGR7_FLEXSPI2(CCM_CCGR_ON);
 
I just tried it and it worked well. Much faster and passed all tests. But probably not fast enough for Dirk.
 
I tried it as well in both the memory test program and my partitioned convolution code.
It's faster, but not fast enough, unfortunately.
Are we sure it's using multiple SPI 'lanes' for each chip?
 
Pity, I was hoping there was a 'magic' way to make it work.
I really liked the idea of convolution reverb in a tiny box.
As a cab sim it sounds really good with some quality IR wav files.
Reorganising the algorithm for more locality and caching doesn't look feasible given that adding a simple memcpy to a buffer nearly doubles the CPU usage even without using EXTMEM.
 
You can review your sampling rate and your need for low frequency resolution to use less memory. Or go to something like a Raspberry Pi Zero.
 
dumb question if 2 spi RAM chips

Yes, very sure it's really using 4 bit mode.

We're also using the PSRAM's fastest (not default) mode where the initial command is transmitted with all 4 bits.

So is that 4-bits per chip/ Are the two chips in parallel or serial? I assume you would write 3 bytes for an address the looks like 1 or 2 bytes of data 12 clock cycles according to the datasheet on the SPI SRAM.

I am curious is this mapped like some of the old school 16-bit data bus ones of writing to both chips at the same time, or writing to one or the other depending on where in the 16 MB we are writing to?

Thanks.
 
The connect lines are common/parallel between the two chips as the processor has just those lines for the QSPI access connection exposed. Only the Chip Select is unique for the two chips.

Those lines and processor support 'generally' allow hardware control for direct address mapping I/O for Flash read and PSRAM read/write - with helper code needed for Flash write. But the processor controls the CS pins based on address.
 
makes sense

The connect lines are common/parallel between the two chips as the processor has just those lines for the QSPI access connection exposed. Only the Chip Select is unique for the two chips.

Those lines and processor support 'generally' allow hardware control for direct address mapping I/O for Flash read and PSRAM read/write - with helper code needed for Flash write. But the processor controls the CS pins based on address.

Thank you for clarifying that. I thought that was the case. That is really helpful.
 
Is there a way to connect some faster external RAM that would use the DMA controller? 12 address lines would get you to 8MB, and then you could use 8 or 16 lines to transfer data.

Alternately, is there some alternate QSPI memory that can run faster?
 
I was able to cut the size of the fmask buffer in half - because it's symmetrical around the midpoint.

Slower ram didn't work with large impulse responses.
 
Status
Not open for further replies.
Back
Top