Forum Rule: Always post complete source code & details to reproduce any issue!
Results 1 to 13 of 13

Thread: Teensy 4.0 Bitbang - FAST.

  1. #1

    Teensy 4.0 Bitbang - FAST.

    I couldn't help but test the new boards when they arrived. Super simple code, just to see how fast a pin could swing. Answer: 150 MHz.

    This is done without a load, no (as yet) understanding of slew-rate limiting or not (as implemented on Teensy 3.6), and not the best grounding (hence negative ground bounce). Still, freakin' cool.

    Is anyone working on port mapping for parallel IO? It will be insanely fast, and there are chips that can handle it for making and sampling signals (my focus).

    Click image for larger version. 

Name:	Teensy4BitBang.png 
Views:	122 
Size:	38.6 KB 
ID:	17173

    Just FYI, the setup. Agilent MSO-X 4104A, N2795A 1GHz probe. Hopelessly, awfully wrong grounding. This is RF. I couldn't resist though.

    May do it again later with 50 GHz scope just for fun.

    Click image for larger version. 

Name:	IMG_1966.jpg 
Views:	78 
Size:	69.8 KB 
ID:	17174

    Thanks for reading.

    Greg

  2. #2
    Senior Member PaulStoffregen's Avatar
    Join Date
    Nov 2012
    Posts
    24,076
    Would it be ok to post this on social networking? Are you on Twitter, so I can give you credit?

  3. #3

    Improved measurements...

    Paul,
    I don't use social media, but please feel free to post or use anything I put up here. My aim is to help the community. Thank you and your co-developers SO MUCH for the ARM-based Teensy versions. They have been complete game-changers for research, education and, as I see from the internet, animated Chewbacca costumes and such.

    I re-did the measurements with a 100 Ohm load and used D2 as the output for closer grounding. The load limited the swing to 3V but that's ok.

    Confirmed clean, jitter-free 150MHz bit banging is now possible. Woo-hoo!

    With port I/O could move rivers of bits at speed. For now, do it bit-wise.

    Much more coming as I put the Teensy 4.0 through its paces.

    Thanks,
    Greg

    Code:
    //First test of Teensy 4.0 bit-bang speed 8/11/19
    //Note - not yet clear if there are slew-rate control options or not-yet-implemented speed-up techniques.
    //Compiler setting: "Fastest"

    void setup() {
    pinMode(2,OUTPUT);
    // CORE_PIN10_CONFIG = PORT_PCR_MUX(1); // no slew rate limit DOES NOT WORK on TEENAttachment 17180Attachment 17181Attachment 17182Attachment 17183SY 4

    //See about 150MHz output rate.

    }

    void yield () {} //Get rid of the hidden function that checks for serial input and such.

    FASTRUN void loop() {
    noInterrupts();

    while (1)
    {
    digitalWriteFast(2,HIGH);
    digitalWriteFast(2,LOW);
    }

    }


    Setup:

    Click image for larger version. 

Name:	Setup_For_BitBang.jpg 
Views:	45 
Size:	64.5 KB 
ID:	17184

    Waveforms (horizontal scale 2ns/div for first, 500 ps/div for rising and falling edges):

    Click image for larger version. 

Name:	100_Ohm_Load_Bit_Bang.png 
Views:	114 
Size:	39.9 KB 
ID:	17185

    Click image for larger version. 

Name:	RisingEdge.png 
Views:	67 
Size:	33.6 KB 
ID:	17186

    Click image for larger version. 

Name:	FallingEdge.png 
Views:	52 
Size:	33.4 KB 
ID:	17187

  4. #4
    Incidentally, adding in "__asm__ __volatile__ ("nop\n\t"); for delays does not (yet) work, but this is my main "tried and true" delay mechanism for the Teensy 3.6.

    Is there something coming for precise code delays?

    Simply doubling up the HIGH and LOW writes produces exactly 1/2 of the frequency seen above, or going from 150.0 MHz to 75.0 MHz.

    void yield () {} //Get rid of the hidden function that checks for serial input and such.

    FASTRUN void loop() {
    noInterrupts();

    while (1)
    {
    digitalWriteFast(2,HIGH);
    __asm__ __volatile__ ("nop\n\t"); //This has no effect.
    digitalWriteFast(2,HIGH);
    digitalWriteFast(2,LOW);
    __asm__ __volatile__ ("nop\n\t"); //This has no effect.
    digitalWriteFast(2,LOW);

    }
    }

  5. #5
    Senior Member+ defragster's Avatar
    Join Date
    Feb 2015
    Posts
    13,907
    The nop may be executing in parallel? Directly from the dual exec - or indirectly while the bus clock is coming around for the change?

    Look into : static inline void delayNanoseconds(uint32_t nsec)

    In the tight ' while(1) ' in use the yield() doesn't apply - that only comes in on return/exit from loop() before re-entry.

    Seems 'port' based I/O is not supported on T4?

    Early in beta there seemed to be diff in timing of transitions? That was based on ARM_DWT_CYCCNT - wonder what the scope shows?

    Code:
    while (1)
    {
    digitalWriteFast(2,HIGH);
    digitalWriteFast(2,LOW);
    }
    Code:
    while (1)
    {
    digitalWriteFast(2,LOW);
    digitalWriteFast(2,HIGH);
    }
    Code:
    while (1)
    {
    digitalWriteFast(2,!digitalReadFast(2) );
    }

  6. #6
    Here you go, scope photos in order of your examples.

    First two: exactly the same, 150 MHz, symmetrical.

    Last one: 23.077 MHz, symmetrical. 6.5X slowdown.

    Click image for larger version. 

Name:	Defrag1.png 
Views:	66 
Size:	43.0 KB 
ID:	17189

    Click image for larger version. 

Name:	Defrag2.png 
Views:	69 
Size:	43.1 KB 
ID:	17190

    Click image for larger version. 

Name:	Defrag3.png 
Views:	81 
Size:	41.2 KB 
ID:	17191

  7. #7
    Trying basic direct digital synthesis that works perfectly on the 3.6 and getting strange "wobble" in frequency, seems random and is about 10% around the desired frequency. The basic idea is the make a look-up-table of the waveform to be synthesized and "walk through" it based on the top bits of a 32-bit word that is incremented repetitively with a "fraction" that is determined based on the ratio of the desired output frequency and the effective sample rate of the loop, determined empirically for the actual loop code. The output of the look-up-table is output to a set of pins, which are connected to a D/A converter.

    Any ideas? Are there background processes that might be taking variable amounts of time and not suppressed by interrupt priority (assuming that even works with 4.0 as I have used it)?

    Code:
    //  Port-Write DDS Experiments, Teensy 4.0, 600MHz clock
    //  G. Kovacs, 8/12/19
    //  Calculate a 8-bit (uint16_t, straight binary) sinewave look-up-table
    //  LS bit = D0, MS-bit = D7.
    
    #define arraySize 256  //Must be an even power of two, and usually the same number of points as the number of DAC steps, here 256.
    
    int numbitsLUT = int(log(arraySize)/log(2));  //Compute number of bits needed to address DDS LUT, and later, from this, how many to shift the DDS accumulator to address LUT.
    const uint16_t DDSshift = 32 - numbitsLUT;    //use const for speed
    
    uint8_t  waveform[arraySize];
    uint16_t synthTemp;
    
    int pinPoint;
    uint8_t pointOut;
    uint16_t pointer = 0;
    uint32_t sum = 0;     //This is the DDS "accumulator" that rolls over at a frequency determined by "pointerIncrement," thus defining the output frequency.
    float freqOut = 100.000E3;                //Desired output frequency
    const float measuredSampleRate = 10.1E6; //Effective sample rate, determined by trial and error for a given code version.
    
    uint32_t pointerIncrement = 0;
    
    void setup() {   
        for (int i = 0; i < 8; i++) {pinMode(i, OUTPUT);}
    
        pointerIncrement = int((freqOut/measuredSampleRate)*pow(2,32)+0.5);
        
    for (uint16_t i=0; i<arraySize; i++)
          {
            // Data is scaled to 0..255, unsigned, 8-bit binary for DAC.
            // This floating point mapping gives a verified good sine LUT.
            synthTemp = (32767.5+(32767.5*sin(2*3.141592654*(float(i)/(arraySize-1)))));  //Here use a sinewave, but could be anything desired.
    
            waveform[i] = synthTemp>>8; // shift data here for 8-bit external DAC. Adjust as necessary.        
          }
          //See: https://forum.pjrc.com/threads/27690-IntervalTimer-is-not-precise
          //Technique to reduce intervalTimer jitter.
          SCB_SHPR3 = 0x20200000;  // Systick = priority 32 (defaults to zero, or highest priority)
    }
    
    void yield () {} //Get rid of the hidden function that checks for serial input and such.
    
    FASTRUN void loop() 
    // Use FASTRUN to force code to run in RAM.
    {   
      noInterrupts();
      while (1) //Loop inside void loop () avoids the overhead of the main loop.
    
     {
         pointer = sum >> DDSshift; //Shift to fit range waveform look-up table. Change as needed.
         pointOut = waveform[pointer];
         
         digitalWriteFast(7, (0x80 & pointOut)); //MSbit
         digitalWriteFast(6, (0x40 & pointOut));
         digitalWriteFast(5, (0x20 & pointOut));  
         digitalWriteFast(4, (0x10 & pointOut)); 
         digitalWriteFast(3, (0x08 & pointOut)); 
         digitalWriteFast(2, (0x04 & pointOut)); 
         digitalWriteFast(1, (0x02 & pointOut)); 
         digitalWriteFast(0, (0x01 & pointOut)); //LSbit   
    
    //    Value added to "sum" determines the output frequency. Larger values added translate to lower frequencies. That's DDS!
    //     sum = sum + 0x80000000;  //For sine LUT, should be Nyquist rate, or 1/2 of effective sample rate, on the MSBit, pin 29. (mayb 0x7FFFFFFF?).
         sum = sum + pointerIncrement;  //For sine LUT, should be Nyquist rate, or 1/2 of effective sample rate, on the MSBit, pin 29.
       }
    }

    With a simple R2R D/A it works except for the "wobble" - the spikes are because the output is not yet lowpass filtered. On the Teensy 3.6, I can do >8 Megasamples/second at 16 bits out using port writes.

    Click image for larger version. 

Name:	DDSWobble.png 
Views:	91 
Size:	56.2 KB 
ID:	17192

  8. #8
    Senior Member+ defragster's Avatar
    Join Date
    Feb 2015
    Posts
    13,907
    Quote Originally Posted by StanfordEE View Post
    Here you go, scope photos in order of your examples.

    First two: exactly the same, 150 MHz, symmetrical.

    Last one: 23.077 MHz, symmetrical. 6.5X slowdown.

    ...
    Interesting - really squares off better with more time - i.e. zoomed out of the transitions 6.5X

    Should have added a 4th query to drop Read() time:
    Code:
    bool tgl=true;
    while (1)
    {
      digitalWriteFast(2,tgl );
      tgl=!tgl;
    }
    Wrote a sketch that does FreqCount at that shows 30 MHz, and the Write( !Read ) gives freq of 23 MHz - it doesn't measure well above that.

    Also check out this post :: https://forum.pjrc.com/threads/54711...l=1#post212280

    for sample:
    Code:
    static inline void delayCycles(uint32_t) __attribute__((always_inline, unused));
    static inline void delayCycles(uint32_t cycles)
    { // MIN return in 7 cycles NEAR 20 cycles it gives wait +/- 2 cycles - with sketch timing overhead
    	uint32_t begin = ARM_DWT_CYCCNT-12; // Overhead Factor for execution
    	while (ARM_DWT_CYCCNT - begin < cycles) ; // wait 
    }
    Not sure if that has reason to go into the Teensy Core Code?

  9. #9
    Junior Member
    Join Date
    Apr 2021
    Posts
    2
    I'm a bit late to the party here but I've been messing around on other platforms with Cortex-M7 chips while attempting to create a quick PRBS generator using assembly. I've found that the "nop" instruction seems to handled oddly. Using multiple "nop" instructions for example does not always result in an delay that is multiple of the clock cycle time; even when optimisation is turned off. I'm not sure if this might be related to what you have experienced StanfordEE.

  10. #10

    Nop, nop... Who's there?

    Quote Originally Posted by chubby23280 View Post
    I'm a bit late to the party here but I've been messing around on other platforms with Cortex-M7 chips while attempting to create a quick PRBS generator using assembly. I've found that the "nop" instruction seems to handled oddly. Using multiple "nop" instructions for example does not always result in an delay that is multiple of the clock cycle time; even when optimization is turned off. I'm not sure if this might be related to what you have experienced StanfordEE.
    Yup. While most programmers would balk at code-dependent timing, if you use consistent overclock and compiler settings, linear strings of "nops" or loops with them are surprisingly useful. Always use FASTRUN to keep your code in RAM, typically good to use void yield() to cut down background stuff, but best and most critical is "noInterrupts()" - we get rock-solid fast pulses across many projects.

    Ready to get scolded for not "posting complete code," coding like an EE, or for not futzing around with proper HTML code display in forum threads, here is a simple, crude example that you might tweak. I remember seeing someone else's "delayNanoseconds" also. The point is that if you test this and tune it, it can be very stable in terms of delay. I'm a hardware person so I'm not sure about pipelines and that other stuff. The datasheets for the chips look like Masters theses...



    FASTRUN void delayNano(int numNanoseconds)
    //Designed for delays of 50 ns to 10000 ns. Otherwise beware!
    //Some known prolems if passing variables with larger delay values (versus direct numeric input).

    {
    const float scaleFactor = 3.0;
    const float offsetFactor = -3.0;

    int iterations = int(numNanoseconds/scaleFactor + offsetFactor + 0.5); //Empirically derived time scaling.

    for (int i = 0; i < iterations; i++)
    {
    __asm__ __volatile__ ("nop\n\t");
    }
    }

  11. #11
    Junior Member
    Join Date
    Apr 2021
    Posts
    2
    Cool, I hadn't heard of that FASTRUN thing before. I had been using either a while(1) or an assembly jump instruction to get a quick consistent loop.

    Just in case its useful to anyone, here's a short bit-duration PRBS7 on the teensy 4.0 (I don't have a fast enough oscilloscope to measure the bit duration but it looks to be less than 9ns/bit):

    Code:
    #pragma GCC optimize ("-O0") // No optimisation
    #pragma GCC push_options
    
    // <9ns bit duration PRBS7 on the teensy 4.0
    // Output on pin 10
    // K.Chubb 19/Apr/21
    
    // Teensy 4.0 documentation:
    // Schematic - https://www.pjrc.com/teensy/schematic.html 
    // Reference manual -  https://www.pjrc.com/teensy/IMXRT1060RM_rev2.pdf pg960 for GPIO write pseudocode
    // Mux Macros -  https://github.com/PaulStoffregen/cores/blob/master/teensy4/imxrt.h
    // Pin macros - https://github.com/PaulStoffregen/cores/blob/master/teensy4/core_pins.h
    
    void setup()
    {
      // Select GPIO function on pin 10 (5 selects GPIO mode, pg528 of reference manual + check schematic)
      IOMUXC_SW_MUX_CTL_PAD_GPIO_B0_00 = 5; // pin 10 on pad B0_00
      
      // Optimise for 100MHz range, 34-23 Ohm o/p resistance @ 3.3V, Fast slew rate (pg650 of reference manual) 
      IOMUXC_SW_PAD_CTL_PAD_GPIO_B0_00 |= IOMUXC_PAD_SPEED(1) | IOMUXC_PAD_DSE(7) | IOMUXC_PAD_SRE;
      
      // Make the pins on GPIO7 work at high speed (up to 150MHz, pg375 of reference manual)
      IOMUXC_GPR_GPR27 = 0xFFFFFFFF;
      
      // Make pin 10 an output (check core_pins.h)
      GPIO7_GDIR |= CORE_PIN10_BITMASK;
    
      // Move output register to the GPIO7_DR address. (Data register is at the start of the GPIO7 offset, pg961 of reference manual)
      __asm__("ldr r0, =0x42004000"); 
    
      // Apply seed to virtual LSFR register for PRBS7
      __asm__("mov r3, #0x7f  \n\t"); 
    
      // Disable interrupts (for consistent PRBS stream)
      noInterrupts();
    }
    
    void loop() {
      __asm__("loopy:");
      __asm__("mov.w  r9, r3, lsr #6");  // Move bit7 of LSFR register to bit1 of r9
      __asm__("mov.w  r6, r3, lsr #5");  // Move bit6 of LSFR register to bit1 of r6
      __asm__("eor.w  r8, r9, r6");      // r8 contains exclusive or of r9 and r6
      __asm__("and.w  r8, r8, #1");      // bit1 of r8 now contains just bit7^bit6 of LSFR register
      __asm__("mov.w  r3, r3, lsl #1");  // Shift LSFR register by 1 to the left (bits moving onward)
      __asm__("orr.w  r3, r3, r8");      // Move bit7^bit6 into bit1 of shifted LSFR register
      __asm__("and.w  r3, r3, #127");    // Modulo 127?
      __asm__("str  r9, [r0, #0]");      // Move output value to GPIO7_DR register
      
      // add nop in here for single cycle delay
      
      __asm__("b.w loopy");              // Jump back to "loopy:"
    }
    
    #pragma GCC pop_options

  12. #12
    Senior Member+ Frank B's Avatar
    Join Date
    Apr 2014
    Location
    Germany
    Posts
    8,280
    NOP means "No operation", and the CM7 does exactly that. Sometimes, it ignores a nop completely and just removes it from the pipelines (zero cycles), sometimes it takes one cycle.
    Often, two NOPs take one cycle together.
    This is not really predictable.
    I think I read that the same thing can happen with a "pseudo-NOP" a la mov r0,r0.

    Edit:
    It is best not to rely on clock cycles. Due to the dual-issue feature, timing based on this is difficult to realize.
    Even more if you take the different busses with different clock, which may need syncing, into account.


    Edit II: I think i've seen a switch somewhere that can disable "dual issue". I don't remember exactly..

    FASTRUN makes sure the code is in the faster RAM. It is not really needed on Teensy 4.x where this is the default.
    On the other models, it has little side effects, esp if you try to inline "FASTRUN". The full definition is :

    #define FASTRUN __attribute__ ((section(".fastrun"), noinline, noclone ))
    Last edited by Frank B; 04-26-2021 at 11:03 AM.

  13. #13
    Senior Member
    Join Date
    Jul 2015
    Posts
    107
    Quote Originally Posted by Frank B View Post
    Edit II: I think i've seen a switch somewhere that can disable "dual issue". I don't remember exactly..
    See thread here for an example of disabling dual issue. Be aware that code will run significantly slower if dual issue is disabled.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •