To see how fast code like this to generate clocks and grap data can run, use a minimalist approach.
//Minimalist pin output clock speed test for Teensy 3.6, overclocked at 240 MHz
//G. Kovacs, 7/6/19
//Use bit-banging to make a clock for an external ADC and port reads to capture output (later, need to unscramble the port pin mapping).
const int clockPin = 25;
uint32_t sample;
void setup() {
pinMode(clockPin,OUTPUT);
CORE_PIN25_CONFIG = PORT_PCR_MUX(1); // No slew-rate-limiting on pin 25
}
void loop() {
for(int i = 0; i<16384; i++)
{
digitalWriteFast(clockPin,HIGH);
sample = GPIOB_PDIR;
digitalWriteFast(clockPin,LOW);
sample = GPIOC_PDIR;
}
}
Note: LTO = Link Time Optimizations (looking through compiled code to remove unused or dropped elements).
Trying every single compiler option methodically, it was clear that there was a trade-off between duty cycle (important for many fast ADC's) and sample rate.
Here are two useful examples:
Compiler (default) "Faster," 240MHz overclock, Teensy 3.6.
Compiler "Faster with LTO"
The fastest with decent (close to 50%) duty cycle? Many options gave the same, at 17.2MSPS. These options are:
Faster
Fast
Fastest
Fastest + pure-code
Smallest Code
So... no difference.
Turning on LTO invariably leads to a duty cycle closer to 20% (not great), with the winner being "Fast with LTO," coming in at a nice, but not really usable (due to duty cycle) 26.9 MSPS...
So it is quite noteworthy that in this particular case (raw bit-banging and port read speed), the compiler options do not literally translate to the stronger term for speed "fastest" actually being that.
Caveats, of course, are that this simplistic example is not putting the samples into an array, as one would in practice. Also, one can easily write code to generate "inline" versions of the acquisition code that literally hard-code the array index instead of incrementing in a loop. In my experience, this leads to insane (many minutes) compile times with flaky (variable, sometimes jittery timing) results. I have yet to see this approach really improve things.
Normally, one would use a toggled flip-flop to generate a clean, 50% duty-cycle clock in cases where the duty cycle was not ideal such as the fastest cases here. Of course, that divides the sample rate in half. Without using more complicated techniques to multiply the rate back up, one is pretty much stuck with the duty cycles that are within spec for the ADC chosen.
If you are trying to do something like this, consider playing with compiler optimization. Incidentally, FASTRUN does not always make things run faster. For example, in the winning example above (Fast with LTO), it makes no difference if void loop () is defined with FASTRUN or not, presumably because it is already optimal.
I hope this is useful to some of you out there.