Numeric Instability, FPU context issue?

avsteele · Nov 25, 2024

I'm been building a digital LockIn Amplifier. I am seeing very regular instability in the output of calls to e.g.


arm_biquad_cascade_df2T_f32

This occurs approximately every 1 million calls, or ADC samples collected. The magnitude of the instability has been found to related to the pretense of seemingly unrelated floating point operations elsewhere in the code.

I have two ISRs which run when a digital line indicates a sample is ready or a SYNC signal is detected. These do not have floating point math in them. One merely records ARM_DWT_CYCCNT and the other collects a ADC sample by reading GPIO6_PSR and timestamps this by again recording ARM_DWT_CYCCNT

I do not have a very solid hypothesis yet, but thoughts:

1) From what I gather there might be a floating point issue. I have not been able to figure out how to enable 'full fpu context saving' but from what I gather in other posts this might not be needed/already the case.

2) I wanted to enable 'fpu flush to zero' but am uncertain if the code below is correct to do this:

C:

uint32_t fpscr;
    // Read FPSCR
    __asm volatile ("VMRS %0, fpscr" : "=r" (fpscr));
    // Set FTZ (bit 24) and DN (bit 25)
    fpscr |= (1U << 24) | (1U << 25);
    // Write back to FPSCR
    __asm volatile ("VMSR fpscr, %0" : : "r" (fpscr));
    // Data Synchronization Barrier and Instruction Synchronization Barrier
    __DSB();
    __ISB();

3) No pointers are used anywhere in the code except for the arrays which are sent to the CMSIS library. So it isn't clear to me how my code could be causing a corruption, let one one which is so regular.

Thank you very much for any thoughts

jmarsh · Nov 25, 2024

avsteele said:
1) From what I gather there might be a floating point issue. I have not been able to figure out how to enable 'full fpu context saving' but from what I gather in other posts this might not be needed/already the case.

It should not be needed, volatile FPU registers will be saved if any ISR accesses them.

avsteele · Nov 26, 2024

I think what I am seeing is a real effect having to do with the phase of the signal and not necessarily any floating point errors.. It would still be useful to know

1) how to set the priority of an interrupt on a pins 24,25 (GPIO6 12 and 13).
2) if there are any quirks in reading the system clock

PaulStoffregen · Nov 26, 2024

I don't have any specific advice about the floating point issues you're seeing. But I will mention that certain biquad filters are notorious for needing high precision in their coefficients. As a general rule of thumb, the smaller your pass band (relative to Nyquist frequency) the higher resolution you need.

avsteele said:
1) how to set the priority of an interrupt on a pins 24,25 (GPIO6 12 and 13).

NVIC_SET_PRIORITY(irqnum, priority)

Priority is 0 to 255, where lower numbers mean higher priority.

All the fast GPIO share 1 interrupt, which is IRQ_GPIO6789.

It is possible to use the GPR registers to reassign specific I/O pins back to slow GPIO1-4, which have their own interrupt. Usually this is only done for DMA, because the fast GPIO aren't reachable by DMA. But it might also make sense in certain unusual cases where you want a different interrupt for certain pins (which of course must be on separate GPIO ports). But normally the slowness of GPIO1-4 means the DMA is the only really compelling use case. OctoWS2811 is a library which does this, if you want an example.

avsteele said:
2) if there are any quirks in reading the system clock

ARM_DWT_CYCCNT is inside the CPU, so you don't suffer bus latency from DMA and other stuff like accessing peripherals or non-TCM memory. No bus or cache to make things complicated. It's among the most deterministic things you get.

Of course there are always quirks like the LD and ST instructions require 2 registers, so the compiler might need extra instructions in some code but in other simpler code it might optimize the initialization outside of the code you're trying to profile. The compiler can also end up generating much slower code if there is too much "register pressure" and it needs to "spill" local variables onto the stack. Teensy's default setup always uses DTCM for the stack, but if use something like an RTOS with stacks in normal (not TCM) memory bad performance can become much worse.

Likewise the CPU has a 6 stage pipeline and 2 integer execution units which usually sustain at least 1 instruction per clock and sometimes give you 2 instructions per clock, but when a branch isn't correctly predicted or all sorts of other edge cases come up, you can get pipeline stalls that result in more than 1 clock to execute certain instructions.

The CPU and compiler can also play a lot of optimization tricks, which is why the CPU barrier instructions exist (forcing the CPU to complete bus operations and cache writes) and the compiler has inline ASM syntax for "memory barriers" which prevent the optimizer reorganizing memory accesses more than you wanted. Many times people have tried "simple" benchmarks that failed to capture the full work they tried to measure because the CPU and compiler do so many tricky optimizations.

But these issues apply to anything memory mapped. It's just the nature of modern processors and compilers.

avsteele · Nov 27, 2024

Thanks Paul, JMarsh.

I think I identified the issue. I believe occasionally the interrupt execution is being delayed by about (max) 68 clocks cycles more than usual. Basically the delay spikes to this value, then decays away as certain external signals are synchronized/desynchronized.

I am speculating that the delay occurs when more background work is required before the ISR can run (maybe during certain floating point operations in the CMSIS library which require that floating point context save/restore)

The amount of delay can be greater than noted above if I use certain functions (e.g. sinf instead of arm_sin_f32

If the context saving is 'lazy' and the amount done is only as-needed. I'm not sure there is any route to fixing this. Don't think this kills it but I am still interested if there is any workaround.

MarkT · Nov 27, 2024

If you aren't sampling at a fixed rate you can expect issues with DSP on the sample stream. Whether this is what you are seeing its hard to guess from the information available. What is this "regular instability" like? Graphs?

Jitter in the sampling time can be avoided by using DMA triggered from a timer to drive the sampling, though this isn't necessarily possible depending on the hardware.

The fact that 68 cycles makes a difference suggests you are sampling at a high rate? Sounds like you have a parallel interface if using GPIO6_PSR, and normally parallel interfaces would be clocked from a xtal-derived clock.

avsteele · Jan 22, 2025

Just wanted to update on this. Between cutting the ISR to just recording a timestamp, and then reducing the order of the IIR filter I was able to get really great results and almost no instability.

So the Teensy 4.1 is fully capable of acting as a digital lock in amplifier with a output bandwidth of at least ~500 Hz, using a 50 kHz modulation. In my case I'm using it to lock a laser with low-noise and decent bandwidth. I figure the Teensy (+ everythign else on the board: ADC, DAC etc) is performing at least as well and probably better than a FPGA-based solution i used previously (laser vendor's solution, RedPitaya etc...). This is a pretty demanding application.

Thanks everyone for your thoughts and feedback.

Numeric Instability, FPU context issue?

avsteele

Member

jmarsh

Well-known member

avsteele

Member

PaulStoffregen

Well-known member

avsteele

Member

MarkT

Well-known member

avsteele

Member