Teensy 4.0 Bitbang - FAST.

StanfordEE

Well-known member
I couldn't help but test the new boards when they arrived. Super simple code, just to see how fast a pin could swing. Answer: 150 MHz.

This is done without a load, no (as yet) understanding of slew-rate limiting or not (as implemented on Teensy 3.6), and not the best grounding (hence negative ground bounce). Still, freakin' cool.

Is anyone working on port mapping for parallel IO? It will be insanely fast, and there are chips that can handle it for making and sampling signals (my focus).

Teensy4BitBang.png

Just FYI, the setup. Agilent MSO-X 4104A, N2795A 1GHz probe. Hopelessly, awfully wrong grounding. This is RF. I couldn't resist though.

May do it again later with 50 GHz scope just for fun.

IMG_1966.jpg

Thanks for reading.

Greg
 
Improved measurements...

Paul,
I don't use social media, but please feel free to post or use anything I put up here. My aim is to help the community. Thank you and your co-developers SO MUCH for the ARM-based Teensy versions. They have been complete game-changers for research, education and, as I see from the internet, animated Chewbacca costumes and such. :)

I re-did the measurements with a 100 Ohm load and used D2 as the output for closer grounding. The load limited the swing to 3V but that's ok.

Confirmed clean, jitter-free 150MHz bit banging is now possible. Woo-hoo!

With port I/O could move rivers of bits at speed. For now, do it bit-wise.

Much more coming as I put the Teensy 4.0 through its paces.

Thanks,
Greg

Code:
//First test of Teensy 4.0 bit-bang speed 8/11/19
//Note - not yet clear if there are slew-rate control options or not-yet-implemented speed-up techniques.
//Compiler setting: "Fastest"

void setup() {
pinMode(2,OUTPUT);
// CORE_PIN10_CONFIG = PORT_PCR_MUX(1); // no slew rate limit DOES NOT WORK on TEENView attachment 17180View attachment 17181View attachment 17182View attachment 17183SY 4

//See about 150MHz output rate.

}

void yield () {} //Get rid of the hidden function that checks for serial input and such.

FASTRUN void loop() {
noInterrupts();

while (1)
{
digitalWriteFast(2,HIGH);
digitalWriteFast(2,LOW);
}

}


Setup:

Setup_For_BitBang.jpg

Waveforms (horizontal scale 2ns/div for first, 500 ps/div for rising and falling edges):

100_Ohm_Load_Bit_Bang.png

RisingEdge.png

FallingEdge.png
 
Incidentally, adding in "__asm__ __volatile__ ("nop\n\t"); for delays does not (yet) work, but this is my main "tried and true" delay mechanism for the Teensy 3.6.

Is there something coming for precise code delays?

Simply doubling up the HIGH and LOW writes produces exactly 1/2 of the frequency seen above, or going from 150.0 MHz to 75.0 MHz.

void yield () {} //Get rid of the hidden function that checks for serial input and such.

FASTRUN void loop() {
noInterrupts();

while (1)
{
digitalWriteFast(2,HIGH);
__asm__ __volatile__ ("nop\n\t"); //This has no effect.
digitalWriteFast(2,HIGH);
digitalWriteFast(2,LOW);
__asm__ __volatile__ ("nop\n\t"); //This has no effect.
digitalWriteFast(2,LOW);

}
}
 
The nop may be executing in parallel? Directly from the dual exec - or indirectly while the bus clock is coming around for the change?

Look into : static inline void delayNanoseconds(uint32_t nsec)

In the tight ' while(1) ' in use the yield() doesn't apply - that only comes in on return/exit from loop() before re-entry.

Seems 'port' based I/O is not supported on T4?

Early in beta there seemed to be diff in timing of transitions? That was based on ARM_DWT_CYCCNT - wonder what the scope shows?

Code:
while (1)
{
digitalWriteFast(2,HIGH);
digitalWriteFast(2,LOW);
}

Code:
while (1)
{
digitalWriteFast(2,LOW);
digitalWriteFast(2,HIGH);
}

Code:
while (1)
{
digitalWriteFast(2,!digitalReadFast(2) );
}
 
Here you go, scope photos in order of your examples.

First two: exactly the same, 150 MHz, symmetrical.

Last one: 23.077 MHz, symmetrical. 6.5X slowdown.

Defrag1.png

Defrag2.png

Defrag3.png
 
Trying basic direct digital synthesis that works perfectly on the 3.6 and getting strange "wobble" in frequency, seems random and is about 10% around the desired frequency. The basic idea is the make a look-up-table of the waveform to be synthesized and "walk through" it based on the top bits of a 32-bit word that is incremented repetitively with a "fraction" that is determined based on the ratio of the desired output frequency and the effective sample rate of the loop, determined empirically for the actual loop code. The output of the look-up-table is output to a set of pins, which are connected to a D/A converter.

Any ideas? Are there background processes that might be taking variable amounts of time and not suppressed by interrupt priority (assuming that even works with 4.0 as I have used it)?

Code:
//  Port-Write DDS Experiments, Teensy 4.0, 600MHz clock
//  G. Kovacs, 8/12/19
//  Calculate a 8-bit (uint16_t, straight binary) sinewave look-up-table
//  LS bit = D0, MS-bit = D7.

#define arraySize 256  //Must be an even power of two, and usually the same number of points as the number of DAC steps, here 256.

int numbitsLUT = int(log(arraySize)/log(2));  //Compute number of bits needed to address DDS LUT, and later, from this, how many to shift the DDS accumulator to address LUT.
const uint16_t DDSshift = 32 - numbitsLUT;    //use const for speed

uint8_t  waveform[arraySize];
uint16_t synthTemp;

int pinPoint;
uint8_t pointOut;
uint16_t pointer = 0;
uint32_t sum = 0;     //This is the DDS "accumulator" that rolls over at a frequency determined by "pointerIncrement," thus defining the output frequency.
float freqOut = 100.000E3;                //Desired output frequency
const float measuredSampleRate = 10.1E6; //Effective sample rate, determined by trial and error for a given code version.

uint32_t pointerIncrement = 0;

void setup() {   
    for (int i = 0; i < 8; i++) {pinMode(i, OUTPUT);}

    pointerIncrement = int((freqOut/measuredSampleRate)*pow(2,32)+0.5);
    
for (uint16_t i=0; i<arraySize; i++)
      {
        // Data is scaled to 0..255, unsigned, 8-bit binary for DAC.
        // This floating point mapping gives a verified good sine LUT.
        synthTemp = (32767.5+(32767.5*sin(2*3.141592654*(float(i)/(arraySize-1)))));  //Here use a sinewave, but could be anything desired.

        waveform[i] = synthTemp>>8; // shift data here for 8-bit external DAC. Adjust as necessary.        
      }
      //See: https://forum.pjrc.com/threads/27690-IntervalTimer-is-not-precise
      //Technique to reduce intervalTimer jitter.
      SCB_SHPR3 = 0x20200000;  // Systick = priority 32 (defaults to zero, or highest priority)
}

void yield () {} //Get rid of the hidden function that checks for serial input and such.

FASTRUN void loop() 
// Use FASTRUN to force code to run in RAM.
{   
  noInterrupts();
  while (1) //Loop inside void loop () avoids the overhead of the main loop.

 {
     pointer = sum >> DDSshift; //Shift to fit range waveform look-up table. Change as needed.
     pointOut = waveform[pointer];
     
     digitalWriteFast(7, (0x80 & pointOut)); //MSbit
     digitalWriteFast(6, (0x40 & pointOut));
     digitalWriteFast(5, (0x20 & pointOut));  
     digitalWriteFast(4, (0x10 & pointOut)); 
     digitalWriteFast(3, (0x08 & pointOut)); 
     digitalWriteFast(2, (0x04 & pointOut)); 
     digitalWriteFast(1, (0x02 & pointOut)); 
     digitalWriteFast(0, (0x01 & pointOut)); //LSbit   

//    Value added to "sum" determines the output frequency. Larger values added translate to lower frequencies. That's DDS!
//     sum = sum + 0x80000000;  //For sine LUT, should be Nyquist rate, or 1/2 of effective sample rate, on the MSBit, pin 29. (mayb 0x7FFFFFFF?).
     sum = sum + pointerIncrement;  //For sine LUT, should be Nyquist rate, or 1/2 of effective sample rate, on the MSBit, pin 29.
   }
}


With a simple R2R D/A it works except for the "wobble" - the spikes are because the output is not yet lowpass filtered. On the Teensy 3.6, I can do >8 Megasamples/second at 16 bits out using port writes.

DDSWobble.png
 
Here you go, scope photos in order of your examples.

First two: exactly the same, 150 MHz, symmetrical.

Last one: 23.077 MHz, symmetrical. 6.5X slowdown.

...

Interesting - really squares off better with more time - i.e. zoomed out of the transitions 6.5X

Should have added a 4th query to drop Read() time:
Code:
bool tgl=true;
while (1)
{
  digitalWriteFast(2,tgl );
  tgl=!tgl;
}

Wrote a sketch that does FreqCount at that shows 30 MHz, and the Write( !Read ) gives freq of 23 MHz - it doesn't measure well above that.

Also check out this post :: https://forum.pjrc.com/threads/54711-Teensy-4-0-First-Beta-Test?p=212280&viewfull=1#post212280

for sample:
Code:
static inline void delayCycles(uint32_t) __attribute__((always_inline, unused));
static inline void delayCycles(uint32_t cycles)
{ // MIN return in 7 cycles NEAR 20 cycles it gives wait +/- 2 cycles - with sketch timing overhead
	uint32_t begin = ARM_DWT_CYCCNT-12; // Overhead Factor for execution
	while (ARM_DWT_CYCCNT - begin < cycles) ; // wait 
}

Not sure if that has reason to go into the Teensy Core Code?
 
I'm a bit late to the party here but I've been messing around on other platforms with Cortex-M7 chips while attempting to create a quick PRBS generator using assembly. I've found that the "nop" instruction seems to handled oddly. Using multiple "nop" instructions for example does not always result in an delay that is multiple of the clock cycle time; even when optimisation is turned off. I'm not sure if this might be related to what you have experienced StanfordEE.
 
Nop, nop... Who's there?

I'm a bit late to the party here but I've been messing around on other platforms with Cortex-M7 chips while attempting to create a quick PRBS generator using assembly. I've found that the "nop" instruction seems to handled oddly. Using multiple "nop" instructions for example does not always result in an delay that is multiple of the clock cycle time; even when optimization is turned off. I'm not sure if this might be related to what you have experienced StanfordEE.

Yup. While most programmers would balk at code-dependent timing, if you use consistent overclock and compiler settings, linear strings of "nops" or loops with them are surprisingly useful. Always use FASTRUN to keep your code in RAM, typically good to use void yield() to cut down background stuff, but best and most critical is "noInterrupts()" - we get rock-solid fast pulses across many projects.

Ready to get scolded for not "posting complete code," coding like an EE, or for not futzing around with proper HTML code display in forum threads, here is a simple, crude example that you might tweak. I remember seeing someone else's "delayNanoseconds" also. The point is that if you test this and tune it, it can be very stable in terms of delay. I'm a hardware person so I'm not sure about pipelines and that other stuff. The datasheets for the chips look like Masters theses...



FASTRUN void delayNano(int numNanoseconds)
//Designed for delays of 50 ns to 10000 ns. Otherwise beware!
//Some known prolems if passing variables with larger delay values (versus direct numeric input).

{
const float scaleFactor = 3.0;
const float offsetFactor = -3.0;

int iterations = int(numNanoseconds/scaleFactor + offsetFactor + 0.5); //Empirically derived time scaling.

for (int i = 0; i < iterations; i++)
{
__asm__ __volatile__ ("nop\n\t");
}
}
 
Cool, I hadn't heard of that FASTRUN thing before. I had been using either a while(1) or an assembly jump instruction to get a quick consistent loop.

Just in case its useful to anyone, here's a short bit-duration PRBS7 on the teensy 4.0 (I don't have a fast enough oscilloscope to measure the bit duration but it looks to be less than 9ns/bit):

Code:
#pragma GCC optimize ("-O0") // No optimisation
#pragma GCC push_options

// <9ns bit duration PRBS7 on the teensy 4.0
// Output on pin 10
// K.Chubb 19/Apr/21

// Teensy 4.0 documentation:
// Schematic - https://www.pjrc.com/teensy/schematic.html 
// Reference manual -  https://www.pjrc.com/teensy/IMXRT1060RM_rev2.pdf pg960 for GPIO write pseudocode
// Mux Macros -  https://github.com/PaulStoffregen/cores/blob/master/teensy4/imxrt.h
// Pin macros - https://github.com/PaulStoffregen/cores/blob/master/teensy4/core_pins.h

void setup()
{
  // Select GPIO function on pin 10 (5 selects GPIO mode, pg528 of reference manual + check schematic)
  IOMUXC_SW_MUX_CTL_PAD_GPIO_B0_00 = 5; // pin 10 on pad B0_00
  
  // Optimise for 100MHz range, 34-23 Ohm o/p resistance @ 3.3V, Fast slew rate (pg650 of reference manual) 
  IOMUXC_SW_PAD_CTL_PAD_GPIO_B0_00 |= IOMUXC_PAD_SPEED(1) | IOMUXC_PAD_DSE(7) | IOMUXC_PAD_SRE;
  
  // Make the pins on GPIO7 work at high speed (up to 150MHz, pg375 of reference manual)
  IOMUXC_GPR_GPR27 = 0xFFFFFFFF;
  
  // Make pin 10 an output (check core_pins.h)
  GPIO7_GDIR |= CORE_PIN10_BITMASK;

  // Move output register to the GPIO7_DR address. (Data register is at the start of the GPIO7 offset, pg961 of reference manual)
  __asm__("ldr r0, =0x42004000"); 

  // Apply seed to virtual LSFR register for PRBS7
  __asm__("mov r3, #0x7f  \n\t"); 

  // Disable interrupts (for consistent PRBS stream)
  noInterrupts();
}

void loop() {
  __asm__("loopy:");
  __asm__("mov.w  r9, r3, lsr #6");  // Move bit7 of LSFR register to bit1 of r9
  __asm__("mov.w  r6, r3, lsr #5");  // Move bit6 of LSFR register to bit1 of r6
  __asm__("eor.w  r8, r9, r6");      // r8 contains exclusive or of r9 and r6
  __asm__("and.w  r8, r8, #1");      // bit1 of r8 now contains just bit7^bit6 of LSFR register
  __asm__("mov.w  r3, r3, lsl #1");  // Shift LSFR register by 1 to the left (bits moving onward)
  __asm__("orr.w  r3, r3, r8");      // Move bit7^bit6 into bit1 of shifted LSFR register
  __asm__("and.w  r3, r3, #127");    // Modulo 127?
  __asm__("str  r9, [r0, #0]");      // Move output value to GPIO7_DR register
  
  // add nop in here for single cycle delay
  
  __asm__("b.w loopy");              // Jump back to "loopy:"
}

#pragma GCC pop_options
 
NOP means "No operation", and the CM7 does exactly that. Sometimes, it ignores a nop completely and just removes it from the pipelines (zero cycles), sometimes it takes one cycle.
Often, two NOPs take one cycle together.
This is not really predictable.
I think I read that the same thing can happen with a "pseudo-NOP" a la mov r0,r0.

Edit:
It is best not to rely on clock cycles. Due to the dual-issue feature, timing based on this is difficult to realize.
Even more if you take the different busses with different clock, which may need syncing, into account.


Edit II: I think i've seen a switch somewhere that can disable "dual issue". I don't remember exactly..

FASTRUN makes sure the code is in the faster RAM. It is not really needed on Teensy 4.x where this is the default.
On the other models, it has little side effects, esp if you try to inline "FASTRUN". The full definition is :

#define FASTRUN __attribute__ ((section(".fastrun"), noinline, noclone ))
 
Last edited:
Back
Top