DMAMEM : Long Delay during writing in ISR

Status
Not open for further replies.

geke

New member
Code:
HTML:
Code:
Code:
[CODE]
[/CODE]
The code is quite big and needs custom hardware, therefore I first try to describe the problem. Hopefully, somebody can give me a hint in the right direction.

General Description:
I have an Interrupt which runs every microsecond, and in fact, must be run every microsecond because it reads data from a hardware ADC. In this interrupt, I copy 4 bytes over to the internal memory. In the main loop, I copy over the data from the internal memory to the Ps RAM to have more time when a network problem occurs (see later). When there is data in Ps RAM available it is sent to the Ethernet stack. In the internal Memory and Ps RAM I have a circular buffer organisation but this should not affect the problem that I have.

Problem: When I use RAM1 as internal memory (allocated on the stack: uint16_t buff[BUFF_CAPACITY]) everything works stable. The execution of the whole interrupt needs ca. 300-400 ns.
But when I use RAM2 with DMAMEM (allocated on the stack: DMAMEM uint16_t buff[BUFF_CAPACITY]) the copying of the 4 bytes inside interrupt needs sometimes up to 2-3 microseconds (usually some nanoseconds are normal). This happens periodically but not often: only a certain times per second. Anyway, this is not acceptable because in this situation the sync with the ADC is lost.



To be true I am not very experienced in embedded topics. I was reading through "Thread: T4.0 Memory - trying to make sense of the different regions". There was a hint that buffering may be a problem with DMAMEM. I tested a modification to startup.c which should disable buffering, but this did not change anything.


Code:
I changed in startup.c

SCB_MPU_RBAR = 0x20200000 | REGION(i++); // RAM (AXI bus)
SCB_MPU_RASR = MEM_CACHE_WBWA | READWRITE | NOEXEC | SIZE_1M;

to

SCB_MPU_RBAR = 0x20200000 | REGION(i++); // RAM (AXI bus)
SCB_MPU_RASR = MEM_CACHE_WT | READWRITE | NOEXEC | SIZE_1M;


Can somebody with more Teensy experience give me a hint: What can be the reason for the high writing times to DMAMEM inside the interrupt? Buffering? Memory organisation? And how to avoid it?


Here is a output of Frank Boesings memory logging functions

Code:
Free Stack: 330980
Free Heap: 12000
FLASH:  97320  1.20% of 7936kB (8029144 Bytes free) FLASHMEM, PROGMEM
ITCM:   80200 81.58% of   96kB (  18104 Bytes free) (RAM1) FASTRUN
PSRAM: 16 MB
OCRAM:
   524288 Bytes (512 kB)
-  413824 Bytes (404 kB) DMAMEM
-   98664 Bytes (96 kB) Heap
    11800 Bytes heap free (11 kB), 512488 Bytes OCRAM in use (500 kB).
DTCM:
   425984 Bytes (416 kB)
-   94912 Bytes (92 kB) global variables
-    1332 Bytes (1 kB) max. stack so far
=========

   329740 Bytes free (322 kB), 96244 Bytes in use (93 kB).
Maximal stack usage: 1332
 
The data cache also covers the PSRAM. Perhaps when the cache flushes for PSRAM it blocks.

The ADC library can do DMA on a timer : ...\hardware\teensy\avr\libraries\ADC\examples\adc_timer_dma\adc_timer_dma.ino

That would be more efficient than 1M _isr()'s/sec, and may run on an alternate bus not blocked by the PSRAM update?
 
Sort of hard to help with no real clues.

So what is using DMA? You mention interrupt and copy 4 bytes. Where is DMA used? Are you using DMA to write to the DMA or read from it?

Again not sure change MEM_CACHE_WBWA to MEM_CACHE_WT is going to help you or hurt you. It may force the caching system to have to things then and there. Now of course if what you are always doing is once you do the write to that memory you do a DMA output of it... But again it is hard to know what exactly is going on. Or what the difference is.
 
short on some details - DMAMEM is just used for larger buffer from RAM1 before pushing to PSRAM? Using 1M interrupts/sec - and no DMA noted - yet.

p#2 Wondered if adding the DMA driven ADC read would bypass the bus contention?
 
@defragster - Yes again not sure when he is saying
because it reads data from a hardware ADC
Not sure if this is trying to say it is using ADC on the T4.x or if it is some external hardware ADC that communicates over? SPI? Wire? multiple IO pins?

Again if using one or both ADC units on the T4, than you should be able to use the ADC library and use our example sketch that uses a ringbuffer and reads in data using DMA with a timer that controls when the ADC does the read...
 
Not given code ... Assumed it was _isr() doing an analogRead()?

Interesting that RAM1 can feed PSRAM but RAM2/DMAMEM to PSRAM stalls - with DMAMEM at 1/4 speed and on a different bus?
 
Here is some more information to make it more clear.
I do not use the Teensy ADC hardware. I use an external hardware ADC which is connected with multiple pins. The external hardware ADC is clocked with a FlexPWM timer. The readout interrupt is also connected to this FlePWM timer. In this readout interrupt the ADC result is transferred with SPI and then just copied to internal memory.

Code:
*( (uint32_t*)(&(cb->data_ptr)[data_pos]) ) = adc_data;

In this copy operation, I have the blocking delay when the memory copied to is allocated at RAM2/DMAMEN.

The runtime of this copy operation is ca:
min: 150 ns avg: 150 ns max: 2500 ns

The delay is not happening when:
- The memory copied to is allocated at RAM1
- The afterwards buffering in Ps RAM is avoided (-> The data is written directly from DMAMEM to the network stack)

In short: I do not use the Teensy ADC units. I also do not use async DMA operations (It would be not possible because it should be copied immediately).

It is evident that DMAMEM access and access to the PsRAM somehow influences each other. Is it possible to disable the DMA controller on DMAMEM to make it behave like RAM1? The read/write time can be a little bit slower but it must be constant in time.
 
DMAMEM is just the name for RAM2 AFAIK, ideally useful for DMA transactions. The HEAP is also stored there. It doesn't have any DMA activity in normal use as RAM2.

As noted in the T_4.0 MEMORY doc on PJRC.com it is not TCM memory running at CPU speed like 512KB in RAM1, but runs at CPU/4 for transfers, though is covered by full speed 32KB Data cache that is shared with PSRAM when used.

The PSRAM is 4 bit QSPI mapped memory - the processor handles the access natively - but there are only so many unique bus paths and using RAM2 that shares the data cache must end up cause some change in I/O to the PSRAM.

What happens if the external ADC is read to RAM2 in the _isr() and then moved to RAM1 for transfer to PSRAM?
 
not likely - could be tested setting the timer interrupt priority highest and check for difference.

AFAIK the only regular interrupt is timer tick at 1 Khz.

Other notes above - but wondering what happens if the _isr() just wrote the value direct to PSRAM? It will hit the cache and ideally that will resolve in time between interrupts - which p#1 suggest is keeping the CPU 30+% busy already. Moving the data a second time through RAM2/DMAMEM only clutters/splits the cache and adds overhead.
 
Yes, but the OP has a 400nS interrupt that runs at 1MHz, the probability of a TimerTick hitting that memory write during the microsecond that the Timer Tick fires is 150 nS out of every 1000nS (1MHz) or one to 6, so every 6th Timer Tick overlaps a memory save and that happens every 6mS.

Now this probabilistic argument doesnt quite work if the TimerTicks and the ADC handling interrupts are phase locked. But is shows that the possibility of other interrupts disturbing the memory write is not neglible.
 
Yes, but the OP has a 400nS interrupt that runs at 1MHz, the probability of a TimerTick hitting that memory write during the microsecond that the Timer Tick fires is 150 nS out of every 1000nS (1MHz) or one to 6, so every 6th Timer Tick overlaps a memory save and that happens every 6mS.

Now this probabilistic argument doesnt quite work if the TimerTicks and the ADC handling interrupts are phase locked. But is shows that the possibility of other interrupts disturbing the memory write is not neglible.

Yeah - but that single interrupt takes maybe a 5-8 cpu cycles - plus _isr() xfer overhead:
Code:
extern "C" void systick_isr(void)
{
	systick_cycle_count = ARM_DWT_CYCCNT;
	systick_millis_count++;
}

at 600M cycles per second that isn't very long. And it happens 'only a certain times per second' (OP) - not noted is if that is 5, 50 or 500 times per second - but would add ~10ns not some milliseconds.

More likely guess is p#10 the QSPI overhead when the data (cache) is pushed to the 89Mhz chip with addressing and then 4 bits at a time for some number of bytes. And transfer to RAM2 splitting cache means that cache needs flushed to CPU/4 speed at the same time when it gets overrun. 1 Million 4 byte writes per second is 4MB and moving to RAM 1 to RAM2 then PSRAM triples that to 12 MB/sec - with 2 copies flushing through the 32KB cache rather than one.
 
Ahh and cache is supposed to improve performance, perhaps only PSARAM should be cached in this situation.
 
Ahh and cache is supposed to improve performance, perhaps only PSARAM should be cached in this situation.

Cache does help, but seems the current code is overusing/abusing it - then it takes time later to resolve the cache to real memory - it is just a full speed buffer. In fact without PSRAM in place the Cache works to give false success using parts of that region - until the 32KB runs out.

The 1062 gets throttled AFAIK near 5-7M interrupts/sec even if they are short. As noted the _isr() runs 1M/sec at 300+ns which is minimum of 30% CPU time to start.

32KB isn't large - at 4MB/sec it needs flushed 125 times per second (twice that if the CPU splits it over PSRAM and RAM2). So cutting out the RAM2 removes a middle man and OP suggests that works, and with the cache and not saving to RAM1 then moving direct to PSRAM should be even better - and no worse ... ideally.
 
Also not having seen the code or ADC hardware in use - it is possible the Teensy is requesting the ADC data and then waiting in that 300+ns _isr() for the data to be ready?

If possible the _isr() could read the ADC data on entry ( after the first entry - or if requested the first time in setup() ) then request the ADC data on exit of the _isr() to have it ready on entry to the _isr a 1ms later. If the ADC could function properly like that and the wait for data is very long it might return usable time out of the _isr() to the loop() code.

BTW: that systick code was put in a sketch and calling it 1 million times takes : 1M Calls took 4000286 cycles.
> the compiler is putting it inline (so no function call) and it won't account for the interrupt register save and call/return time:
1M Calls took 4000286 cycles
The 1062 processor can execute two instructions at once it seems in this case to read and store the cycle counter and read and increment the running millis count in parallel.
Compiling it 'debug' removes the inline code adding a call and pushes it to:
1M Calls took 12000701 cycles
 
As all have mentioned the devil is in the details here.

As mentioned, you can maybe turn off the caching and maybe it will help (or not). Could be the underlying hardware address/data buses are getting clobbered and maybe with DTCM memory versus DMAMEM memory the data is moved over two different buses and so avoids some issues.

But again still wonder about things like what really does the ISR to read the external ADC do? Can you avoid all of those... interrupts.

Example: if you are reading in several IO pins to read in the IO pins how are you doing it? Are you doing something like GPIO6_
Than maybe you can convert over to doing GPIO reads using DMA, like I have experimented with the OV7670 code base. I am not sure if the DMA trigger would be through clock or if the Analog device has an IO pin saying its ready that would drive the reads or...

What I am trying to say is you may need to say is we may be looking at trying to resolve one particular approach (with many details missing) and the solution may be to avoid these issues by taking a different approach.
 
Status
Not open for further replies.
Back
Top