Forum Rule: Always post complete source code & details to reproduce any issue!
Page 1 of 2 1 2 LastLast
Results 1 to 25 of 49

Thread: IntervalTimer on Teensy 4.0 versus 3.6

  1. #1

    IntervalTimer on Teensy 4.0 versus 3.6

    I have been working on a lot of low-jitter, fast interrupt applications (mostly precision digital synthesis) using the Teensy 3.6. It's been a great learning experience, and when I come across something that isn't easy to explain (usually needs the code gurus, as I'm mainly a hardware person), I post it here.

    This is about the wonderful IntervalTimer function (https://www.pjrc.com/teensy/td_timin...rvalTimer.html). While it does not yet list support for the 4.0, it does work, except that there is a puzzling issue.

    On the Teensy 3.6, I have implemented a full 16-bit direct digital synthesizer (using port writes, and data pre-scrambled to compensate for all of the re-mapped pins) using the 1 us minimum interrupt interval. This works wonderfully at the 256 MHz overclock setting.

    So, when the Teensy 4.0 came out, I was very curious. After looking at the schematic, it was clear there was no real way to do the fast port write stuff, but I wanted to know how fast I could get IntervalTimer to run. Oddly, for any clock setting from 600 MHz to 1.008 GHz (cooling required!), the shortest period is 1.6 us (yes, you can pass IntervalTimer a float and it ends up internally as an integer).

    Here is the (very simple) code to test this:

    Code:
    // IntervalTimer Max Frequency Test
    // Uses float arguments that are converted to allowable int multiples of F_BUS
    // G. Kovacs 1/23/20
    // 
    // Tested using Teensyduino Version 1.49
    // For Teensy 4.0 minimum IntervalTimer period = 1.6 us *regardless* from 600MHz to 1.008 GHZ (overclock).
    // These results are independent of compiler optimization settings but are presented here for "Fastest" (not default).
    
    IntervalTimer sampleRate;
    
    const int outPin = 14; //Output pin
    const float samplePeriod = 1.6; //Interrupt period in microseconds. May be a float - is internally converted to int.
    boolean outputState = HIGH;
    
    void setup() {
      pinMode(outPin, OUTPUT);
      sampleRate.begin(ISR, samplePeriod);  // Theoretical interrupt period in us as float
    }
    
    FASTRUN void ISR() {
      digitalWrite(outPin, outputState);
      outputState = !outputState;
    }
    
    void loop() {
      while (1) {} // Do nothing...
    }

    Looking at the code, the output on pin 14 should be a squarewave at 1/2 of the interrupt rate, and it is... Here we get 315.8 kHz, for an interrupt rate of 631.6 kHz. Not bad at all.

    Here is the output on pin 14, seen using a Digilent Electronics Explorer:

    Click image for larger version. 

Name:	Teensy 4 Fastest IntervalTimer.jpeg 
Views:	14 
Size:	213.6 KB 
ID:	18834

    Here is where it gets interesting. The same code, on a Teensy 3.6 overclocked at 256 MHz works down to 0.6 us (!!!!!), which is way faster than the Teensy 4.0. The output (shown below) is a 842 kHz squarewave, corresponding to an interrupt rate of 1.68 MHz. Now that's cooking.

    Click image for larger version. 

Name:	Teensy 3.6 Fastest IntervalTimer.jpeg 
Views:	18 
Size:	237.0 KB 
ID:	18835

    So... What is going on here? I've read all of the threads on the F_BUS stuff, but at the end of that, it seems like there is a big opportunity here. If someone could take a look at IntervalTimer to see if it can be optimized for the Teensy 4.0, I (and I suspect a lot of others) would be very, very grateful.

    Thanks for reading!

  2. #2
    Senior Member+ KurtE's Avatar
    Join Date
    Jan 2014
    Posts
    7,127
    I know I have seen this discussed before, my guess is maybe in the big T4 beta thread...

    But if you look at interval timer header file, you will see that most of the calls all boil down to calling:
    Code:
    	bool begin(void (*funct)(), unsigned int microseconds) {
    		if (microseconds == 0 || microseconds > MAX_PERIOD) return false;
    		uint32_t cycles = (24000000 / 1000000) * microseconds - 1;
    		if (cycles < 36) return false;
    		return beginCycles(funct, cycles);
    	}
    Or in your case the float version:
    Code:
    	bool begin(void (*funct)(), float microseconds) {
    		if (microseconds <= 0 || microseconds > MAX_PERIOD) return false;
    		uint32_t cycles = (float)(24000000 / 1000000) * microseconds - 0.5;
    		if (cycles < 36) return false;
    		return beginCycles(funct, cycles);
    	}
    So if you do a my_timer.begin(1);
    It will compute the number of cycles: 24*1-1 = 23.
    Then next line says < 36 and errors out...

    The floating point is about the same: You can sort of compute the min value:
    36 = 24*MS -0.5 so minimum would be: 1.520833... So pretty close to your 1.6

    I am not sure if there is any actual hardware limit of 36, which simply gets stored in the LDVAL register... Or if it is semi arbitrary? For example would it still work if I set the min value of 23?

    I may experiment...

  3. #3
    Senior Member
    Join Date
    Apr 2014
    Location
    Germany
    Posts
    990
    I'm afraid simply setting a smaller reload value won't help. Here a link to some discussions about that issue https://forum.pjrc.com/threads/58221...l=1#post220355 And here a link to one of the related posts in the beta thread. https://forum.pjrc.com/threads/54711...l=1#post195467
    Here some measurement results from the linked post in the beta thread.

    Click image for larger version. 

Name:	itimer2.PNG 
Views:	24 
Size:	17.9 KB 
ID:	18837

    If you find a way how to accelerate things I'd be highly interested.
    Attached Thumbnails Attached Thumbnails Click image for larger version. 

Name:	itimer.PNG 
Views:	26 
Size:	26.7 KB 
ID:	18838  

  4. #4
    Senior Member+ KurtE's Avatar
    Join Date
    Jan 2014
    Posts
    7,127
    Thanks @luni,

    As an experiment, I edited the header file to check for 23...
    Ran a simple test case:
    Code:
    #include <IntervalTimer.h>
    IntervalTimer my_timer;
    uint32_t count = 0;
    uint8_t led_state = 0;
    void isr () {
      digitalWriteFast(2, HIGH);
      count++;
      if (count == 1000000) {
        led_state = led_state? 0 : 1;
        digitalWriteFast(13, led_state);
        count = 0;
      }
      digitalWriteFast(2, LOW);
      asm("dsb");
    }
    
    void setup() {
      pinMode(13, OUTPUT);
      pinMode(2, OUTPUT);
      while (!Serial && millis() < 4000) ;
      Serial.begin(115200);
      if (!my_timer.begin(&isr, 1)) {
        Serial.println("\n*** timer begin failed ***");
        while (1) {
          digitalWrite(13, !digitalRead(13));
          delay(125);
        }
      }
    }
    void loop() {
      
    }
    And I am getting the LED turning on/off every second... But the LA is showing it reasonably...

    Click image for larger version. 

Name:	screenshot.jpg 
Views:	23 
Size:	35.6 KB 
ID:	18839

    As for the PIT timer, we do have other options. We currently have them tied to the 24mhz OSC clock as setup in startup.c
    Code:
    	// PIT & GPT timers to run from 24 MHz clock (independent of CPU speed)
    	CCM_CSCMR1 = (CCM_CSCMR1 & ~CCM_CSCMR1_PERCLK_PODF(0x3F)) | CCM_CSCMR1_PERCLK_CLK_SEL;
    We do have the option of feeding them instead feed it the same system clock that feeds ADC and XBAR...
    Not sure how much work that would be to do and what other ramifications that might have.

  5. #5
    Senior Member
    Join Date
    Apr 2014
    Location
    Germany
    Posts
    990
    I also did some experiments.
    Changed the code in intervaltimer to ignore the limit and printout the calculated ldval. Get the following

    Code:
    LDVAL  measured period
    1	   0.5
    11	   0.5
    21	   0.91
    47   	   2
    479	   20
    Click image for larger version. 

Name:	pittest.png 
Views:	13 
Size:	9.7 KB 
ID:	18840

    So, good thing is that the current code is faster than the code in the beta tests it now saturates at 0.5Ás.

    Frank did experiments with changing the clock in the beta thread. It improved the situation a bit. But, be aware that if you increase the clock maximal delay will decrease accordingly. @150MHz you get a maximum delay of 1/150MHz * 2^32 = 29s.

  6. #6
    Thanks! I did look at IntervalTimer.h and - to my non-expert eyes - it is built for the Kinetis. Changing 36 -> 24 or 12 (in all instances) unfortunately does not help.

  7. #7
    Senior Member
    Join Date
    Apr 2014
    Location
    Germany
    Posts
    990
    No, it is done for the IMXRT. It is pretty efficient now. I don't think that you can get much more out of it without changing the clock.

    Did another experiment to quickly check the load. With 4 timers you can go down now to 3Ás period before it overloads. This is also better than the 100kHz couple of month ago...

    Code:
    IntervalTimer t1, t2, t3, t4;
    
    template<unsigned pin>
    void callback()
    {
      digitalWriteFast(pin, HIGH);
      delayNanoseconds(200);
      digitalWriteFast(pin, LOW);
    }
    
    void setup()
    {
      while(!Serial);
      Serial.println("start");
    
      pinMode(0, OUTPUT);
      pinMode(1, OUTPUT);
      pinMode(2, OUTPUT);
      pinMode(3, OUTPUT);
      pinMode(LED_BUILTIN, OUTPUT);
      
      float p = 3.0f;
    
      t1.begin(callback<0>, p);
      t2.begin(callback<1>, p);
      t3.begin(callback<2>, p);
      t4.begin(callback<3>, p);
     
    }
    
    void loop()
    {
      digitalWriteFast(LED_BUILTIN, !digitalReadFast(LED_BUILTIN));
      delay(100);
    }
    Here the changes in intervaltimer.h I did for the experiments

    Code:
    bool begin(void (*funct)(), float microseconds) 
    {
    	if (microseconds <= 0 || microseconds > MAX_PERIOD) return false;
    	uint32_t cycles = (float)(24000000 / 1000000) * microseconds - 0.5;
    	Serial.println(cycles);
    	//if (cycles < 36) return false;
    	return beginCycles(funct, cycles);
    }
    Last edited by luni; 01-23-2020 at 08:39 PM.

  8. #8
    Senior Member
    Join Date
    Apr 2014
    Location
    Germany
    Posts
    990
    @Kurt: here the link to the beta thread when Frank did experiments with the changed clock https://forum.pjrc.com/threads/54711...l=1#post196569 Might be helpful...

  9. #9
    Quote Originally Posted by luni View Post
    I also did some experiments.
    Changed the code in intervaltimer to ignore the limit and printout the calculated ldval. Get the following

    Code:
    LDVAL  measured period
    1	   0.5
    11	   0.5
    21	   0.91
    47   	   2
    479	   20
    Click image for larger version. 

Name:	pittest.png 
Views:	13 
Size:	9.7 KB 
ID:	18840

    So, good thing is that the current code is faster than the code in the beta tests it now saturates at 0.5Ás.

    Frank did experiments with changing the clock in the beta thread. It improved the situation a bit. But, be aware that if you increase the clock maximal delay will decrease accordingly. @150MHz you get a maximum delay of 1/150MHz * 2^32 = 29s.


    Thanks for sharing. I think anyone needing super long delays shouldn't mind a few nanoseconds of error by using multiple delay calls in series... Ok, maybe somebody at NIST is doing a precision version of "Blink," but probably not.

  10. #10
    Senior Member+ KurtE's Avatar
    Join Date
    Jan 2014
    Posts
    7,127
    @luni, @PaulStoffregen -

    Again this is not a big deal to me, but simply wondering if we should simply reduce this test a ways...

    That is this magic number is a simple hold over from the T3.x code...

    Which is:
    Code:
    	bool begin(void (*funct)(), unsigned int microseconds) {
    		if (microseconds == 0 || microseconds > MAX_PERIOD) return false;
    		uint32_t cycles = (F_BUS / 1000000) * microseconds - 1;
    		if (cycles < 36) return false;
    		return beginCycles(funct, cycles);
    	}
    Which if we run a T3.6 at 180mhz (default) F_BUS= 60mhz.
    So if you pass in 1 (for 1ms) into begin, we get:
    cycles = 59 and so it runs...

    Again our current stuff on T4 cycles comes out as 23 and so the magic test for 36 fails.

    The T3.6 with floating point could go to about: .61us cycle speed.

    Not sure if we should reduce to similar or not... But that would reduce magic number from
    36 to: maybe 14 or 15...

    Or could round up a ways from there to 18 or 19 which gets us a bit over
    Or could go down to 11, which gets us near .5us... When I pass in .5 with this, The logic analyzer does show I am a bit off.
    As you can see in:
    Click image for larger version. 

Name:	screenshot.jpg 
Views:	13 
Size:	18.5 KB 
ID:	18841

    Personally I would probably set it to 11 for allowing the .5us, it may not be exact. But at a minimum I would suggest that we reduce it to <= 23 as to allow sketches the used to run on T3.x that used this to run on the T4 as well.

    @luni - I appreciate all of the details above. And agree that if one were to go to the 150mhz version it could improve things, but also cause problems for those who want real slow timers...

    The only other option would be to allow sketch to choose programmatically. Sort of like anlogSetResolution which changes system setup. And like calls like that can also screw up other code who were depending on the other setting.

  11. #11
    Senior Member
    Join Date
    Apr 2014
    Location
    Germany
    Posts
    990
    Changing that to 150MHz would be great for me if it doesn't have other effects. The GPTs are also running on 24Mhz... However you can write more thight code since they have a seperate interrupt. I just hacked TeensyTimerTool as you suggested and can go down to 2.4 MHz with it. I'll try the QUAD / TMR now. They are running at 150Mhz. Will take a couple of minutes....

  12. #12
    Senior Member
    Join Date
    Apr 2014
    Location
    Germany
    Posts
    990
    Here the results for the TMR/QUAD @150MHz

    Code:
    reload   MHz      Ás
    200      0.75    1.33
    100      1.5     0.67
    50       3       0.33
    25       3.86    0.26
    10       4.03    0.25
    5        4.03    0.25


    Click image for larger version. 

Name:	tmrtest.png 
Views:	12 
Size:	9.4 KB 
ID:	18842

    Saturates at about 4MHz

    EDIT: BUT it completely overloads the processor. Can't even blink a LED for a reload < 25 (f>3.9MHz) and only one timer running

  13. #13
    Senior Member+ KurtE's Avatar
    Join Date
    Jan 2014
    Posts
    7,127
    Good morning all...

    Just an FYI - I created a PR: https://github.com/PaulStoffregen/cores/pull/422
    For now to simply change the magic number of 36 to 17 (could maybe be 18).

    Logic of it is: On T3.2 running at default 72mhz has a bus speed of 36mhz.
    So the min value you could pass in and run was something like: .76458...
    So going that direction: .75*48 - .5 = 35.5 (now depending on float conversion this is either 35 or 36...)

    So if our Interval timer is running at 24mhz then we need to be half that... Maybe 18, but I rounded down to 17...

    This still does not allow it to go as fast as T3.6 or even T3.2 running at 120mhz with bus speed of 60, but at least begin(1) would work.

  14. #14
    Senior Member+ defragster's Avatar
    Join Date
    Feb 2015
    Posts
    11,963
    Quote Originally Posted by StanfordEE View Post
    Thanks! I did look at IntervalTimer.h and - to my non-expert eyes - it is built for the Kinetis. Changing 36 -> 24 or 12 (in all instances) unfortunately does not help.
    Looking at @KurtE's PR for faster timer I updated cores and played with p#4 sketch.

    StanfordEE: if this " asm("dsb"); " is not in the callback _isr() it seems to run at the indicated freq - but loop() throughput is cut in ~half. This is not in first posted code?

    Here is output first with "dsb" commented then active :
    Code:
    T:\tCode\T4\TimerIntPerf\TimerIntPerf.ino Jan 26 2020 13:04:50
    cntLoop=2784516 us=1408006
    cntLoop=2784914 us=2408007
    
    T:\tCode\T4\TimerIntPerf\TimerIntPerf.ino Jan 26 2020 13:05:04
    cntLoop=4972063 us=1399006
    cntLoop=4945857 us=2399007
    Below is modified sketch - loop count 4.9M above is at 0.75 microsecond IntervalTimer. Here is the diff between 1 us and 10 us
    Code:
    T:\tCode\T4\TimerIntPerf\TimerIntPerf.ino Jan 26 2020 13:20:53 << 1us { void Yield :: 15996826
    cntLoop=6248496 us=1402006
    cntLoop=6248440 us=2402007
    
    T:\tCode\T4\TimerIntPerf\TimerIntPerf.ino Jan 26 2020 13:21:05 << 10us { void Yield :: 24022326
    cntLoop=9939940 us=1417001
    cntLoop=9939928 us=2417001
    
    T:\tCode\T4\TimerIntPerf\TimerIntPerf.ino Jan 26 2020 13:22:15 << 100us { void Yield :: 24899825
    cntLoop=10303442 us=1408001
    cntLoop=10303427 us=2408001
    Altered IntervalTimer can init down to #define _RATE .73 or above and fails init below that. The math doesn't always work of course for triggering the LED on 1 sec which is what the loop() uses for timing:
    Code:
    #include <IntervalTimer.h>
    IntervalTimer my_timer;
    uint32_t count = 0;
    
    #define _RATE 1
    #define _COUNT (uint32_t)(1'000'000/_RATE)
    
    uint8_t led_state = 0;
    void isr () {
    	digitalWriteFast(18, HIGH);
    	count++;
    	if (count == _COUNT) {
    		led_state = led_state ? 0 : 1;
    		digitalWriteFast(13, led_state);
    		count = 0;
    	}
    	digitalWriteFast(18, LOW);
    	asm("dsb");
    }
    
    void setup() {
    	pinMode(13, OUTPUT);
    	pinMode(18, OUTPUT);
    	Serial.begin(115200);
    	while (!Serial && millis() < 4000) ;
    	Serial.println("\n" __FILE__ " " __DATE__ " " __TIME__);
    	if (!my_timer.begin(&isr, _RATE)) {
    		Serial.println("\n*** timer begin failed ***");
    		while (1) {
    			digitalWrite(13, !digitalRead(13));
    			delay(125);
    		}
    	}
    }
    
    uint32_t cntLoop = 0;
    uint32_t lastLed = 0;
    void loop() {
    	uint32_t lLed = digitalReadFast(13);
    	cntLoop++;
    	if ( lastLed != lLed ) {
    		lastLed = lLed;
    		Serial.printf("cntLoop=%lu us=%lu\n", cntLoop, micros() );
    		cntLoop = 0;
    	}
    }

  15. #15
    Senior Member+ Frank B's Avatar
    Join Date
    Apr 2014
    Location
    Germany NRW
    Posts
    6,919
    The periphals on T4 are a bit slow sometimes. This is because of the lower bus speeds and the needed internal syncing if you access a periphal register over the bus. A thing that NXP can do better...
    Sometimes it can be made better with overclocking the bus..
    Overall performance can be impoved if you edit clockspeed.c for example (I also mentioned that on the beta thread) .
    there is a line
    Code:
    if (div_ipg > 4) div_ipg = 4;
    - if you edit this maximum div_igp to 2, alle periphals connected with igp will be faster.
    But use with caution, and if you encounter "unexplainable" problems, try setting back to 4... if have not tested this much.. but on T4-beta1 it seemed to be stable. you can also try 3.

    Saying this - I don't remember if the intervaltimer connects to IGP (see ref manual).. but it may help in other cases.

    I had also suggested to set up variables for all frequencies, or a API - unfortunately, there was not much response. Even my code to implement this at least for spi did'nt got much love from other users. Don't know if it is still there.So we have to live now with the fact that all these configurable freqs are hardcoded: means, if you change the freq. of a clocksource or PLL (or switch to an other clocksource), you can't easyly inform the periphals about that. they just ignore that. instead you have to patch code or do it manually in the usersketch. now it is too late, or at least a lot more effort. A thing that NXP did better than us...
    The same situation as with the audio-lib with its fixed hardcoded 44100Hz.
    Last edited by Frank B; 01-26-2020 at 09:57 PM.

  16. #16
    Senior Member+ defragster's Avatar
    Join Date
    Feb 2015
    Posts
    11,963
    Quote Originally Posted by Frank B View Post
    The periphals on T4 are a bit slow sometimes. This is because of the lower bus speeds and the needed internal syncing if you access a periphal register over the bus.
    Sometimes it can be made better with overclocking the bus..
    Overall performance can be impoved if you edit clockspeed.c for example (I also mentioned that on the beta thread) .
    there is a line
    Code:
    if (div_ipg > 4) div_ipg = 4;
    - if you edit this minimum div_igp to 2, alle periphals connected with igp will be faster.
    But use with caution, and if you encounter "unexplainable" problems, try setting back to 4... if have not tested this much.. but on T4-beta1 it seemed to be stable.

    Saying this - I don't remember if the intervaltimer connects to IGP (see ref manual).. but it may help in other cases.

    i had also suggested to set up variables for all frequencies - unfortunately, there was not much response. So we have to live now with the fact that all these freqs are hardcoded.
    I don't recall seeing that go by Frank. This code does I/O Read/Write.

    FRANK: This takes DMAMEM access to FCPU/2 from FCPU/4! … below

    Just edited that and the above changes loop()/sec increasing 1,333,645 with 1us interval:
    Code:
    cntLoop=15996812 us=1732405733
    cntLoop=15996812 us=1733405734
    
    T:\tCode\T4\TimerIntPerf\TimerIntPerf.ino Jan 26 2020 13:56:09
    cntLoop=17330457 us=1407003
    cntLoop=17330315 us=2407003
    And above with yield() at 1us goes to 6854889 from above 6248496 an increase of 606,393

    … wondering what this will break ??? …

    Have five i2c parts connected here - they all still work.

    Here DMAMEM access goes from 3.98 cycles to 2.00 CPU cycles - that includes looping overhead - Same as RAM::
    Code:
      for ( jj = 0; jj < 3; jj++ ) {
        for ( ii = 0; ii < DMA_SIZE; ii++ ) {
          pDMA[jj][ii] = ARM_DWT_CYCCNT;
        }
      }
    DMAMEM with :: if (div_ipg > 4) div_ipg = 4; ::
    Code:
        ============================  DMA TEST Single
    
    Avg CycCnt for 9000 is 3.983889
    CC@0=1 CC@2=6767 CC@6=1114 CC@7=2 CC@9=1 CC@12=1 CC@14=1114 
        ============================
    DMAMEM with :: if (div_ipg > 2) div_ipg = 2;::
    Code:
        ============================  DMA TEST Single
    
    Avg CycCnt for 9000 is 2.001889
    CC@0=1 CC@2=8996 CC@7=2 CC@9=1 
        ============================

  17. #17
    Senior Member+ Frank B's Avatar
    Join Date
    Apr 2014
    Location
    Germany NRW
    Posts
    6,919
    These numbers look impressive, but do not reflect the reality
    Real sketches just do several things with several periphals, have delays, interrupts and so on.
    The real gain is much less than your numbers show..

    Adjusting clocks just helps in special cases.
    (Or if you need more OCRAM throughput - as you showed above )

  18. #18
    Senior Member+ defragster's Avatar
    Join Date
    Feb 2015
    Posts
    11,963
    First was just the sketch I had open - testing IntervalTimer down to 0.73 us interrupts. KurtE has the _isr toggle a pin with each interrupt - and I added a pin read in each of the 6M++ loop()'s per second. Indeed that is just a single benchmark - but it gains 8-9%.

    The other test was one I did when Paul noted DMAMEM runs at F_CPU/4. And indeed those numbers are best case for a test case most likely the dual issue CPU feature helping with read/inc/test/jump to get all that done in 2 cycles - but the same applied to the 4 cycles before.

    The reduction in DMAMEM/RAM2 access is more significant and makes using it less costly. Would make that memory more useful and less a bottleneck - like getting data out of DMA buffers for the Audio library or USB. That single read was first read hitting 4 cycles - the repeat read I did to try to get the cache involved dropped to 3 cycles - so as tested the cycles used dropped - but not as much as faster clocking. If this is actually not a bad idea for any reason and running div_ipg = 2 is a good/doable thing - it might be better to disable the RAM2 cache to avoid the invalidation issues, hassle and overhead for DMA use. And if the cache only covers the FLASH that would make that 32KB go farther for data reads.

  19. #19
    Senior Member+ Frank B's Avatar
    Join Date
    Apr 2014
    Location
    Germany NRW
    Posts
    6,919
    We'd have to check if it is indeed stable to use /2.
    We should look at the clock-distribution diagram (ref manual) which periphals are influenced (and hope that it is correct)

  20. #20
    Senior Member+ defragster's Avatar
    Join Date
    Feb 2015
    Posts
    11,963
    Quote Originally Posted by Frank B View Post
    We'd have to check if it is indeed stable to use /2.
    We should look at the clock-distribution diagram (ref manual) which periphals are influenced (and hope that it is correct)
    Indeed: Paul set it that way for some good reason. Maybe it is spec and like doubling F_BUS on T_3.x is OC, though there might be something specific that breaks I didn't test.

  21. #21
    Senior Member+ Frank B's Avatar
    Join Date
    Apr 2014
    Location
    Germany NRW
    Posts
    6,919
    The div 4 is kind of default and mentioned in the manual, too.
    I don't know why.
    Maybe this bus got used in other CPUs and just wasn't tested for higher speeds? NXP internal Bureaucracy?
    Failures with higher speeds?

  22. #22
    Senior Member PaulStoffregen's Avatar
    Join Date
    Nov 2012
    Posts
    22,279
    NXP's specs say IPG clock is 150 MHz max. Like F_BUS on Teensy 3.x, most of the peripherals can probably overclock quite a lot. But faster than 150 MHz is risking trouble since it's over the spec.

    According the documentation I've seen, OCRAM is always accessed at CPU/4 speed. IPG is for peripherals. It isn't supposed to be used for OCRAM.

    I'm curious to see how these measurements are being made. Maybe there's a mistake in NXP's documentation (or my reading & understanding of it). Or maybe the test is sensitive to IPG clock in some way?

  23. #23
    Senior Member+ defragster's Avatar
    Join Date
    Feb 2015
    Posts
    11,963
    Using the value of 2 as above code below should show the numbers above. I hacked out the other part that tested RAM and the Free ITCM area above allocated code.

    PROBLEM:: Opps - In running this reduced code one whole section of output is skipped with 'div_ipg = 2'.
    Prior code had more noise and I was only looking at the printed summary shown below. But this showDmaDiff( DMA_SIZE/10, 1 ); should be dumping 1/10th of the array to USB and it never does with divisor of 2 but does with 4. FAIL! Only the summary prints showing the diff summary prints - that uses the same function param2=0.
    Adding delayMicroseconds(200) Twice in the inner loop for each line and that did not help. I did run lines per sec test and that ran as well to PC.


    Code:
    #define DMA_SIZE 32000
    DMAMEM uint32_t pDMA[3][DMA_SIZE];
    Is allocated - here 3 sets of 128KB - and then in a loop filled with ARM_DWT_CYCCNT values. Then the diff between those assignments indicates the cycles between value storage. Maybe the cache writes in the background faster than the code runs.

    With: if (div_ipg > 2) div_ipg = 2;
    Code:
        ============================  DMA TEST Single
    
    pDMA[0]@202156e8 diff=384
    Avg CycCnt for 32000 is 2.012469
    CC@0=1 CC@2=31995 CC@7=2 CC@9=1 CC@29=1 
        ============================
    
    Avg CycCnt for 32000 is 2.000000
    CC@0=2 CC@2=63994 CC@7=2 CC@9=1 CC@29=1 
        ============================
    
    Avg CycCnt for 32000 is 2.000000
    CC@0=3 CC@2=95993 CC@7=2 CC@9=1 CC@29=1 
        ============================
    With: if (div_ipg > 4) div_ipg = 4;
    Code:
        ============================  DMA TEST Single
    
    pDMA[0]@2020ac6c diff=503
    Avg CycCnt for 32000 is 4.005844
    CC@0=1 CC@2=24688 CC@5=1 CC@6=3327 CC@7=2 CC@9=1 CC@10=653 CC@12=1 CC@14=2020 CC@18=1305 CC@29=1 
        ============================
    
    Avg CycCnt for 32000 is 3.999656
    CC@0=2 CC@2=49687 CC@3=1 CC@5=1 CC@6=6327 CC@7=2 CC@9=1 CC@10=1653 CC@12=1 CC@14=3020 CC@18=3304 CC@29=1 
        ============================
    
    Avg CycCnt for 32000 is 3.999750
    CC@0=3 CC@2=74686 CC@3=1 CC@5=1 CC@6=9328 CC@7=2 CC@9=1 CC@10=2653 CC@12=1 CC@14=4020 CC@18=5303 CC@29=1 
        ============================
    ** The CC is short for CycleCount. Diffs between consecutive elements are made and those are the tallies of CC's in those diffs. There are 30 bins - so anything 29 and over goes there.

    Code:
    // ____RAM2 Speed test_____________________________________________________________________
    #define DMA_SIZE 32000
    DMAMEM uint32_t pDMA[3][DMA_SIZE];
    // ____RAM2 Speed test_____________________________________________________________________
    
    
    void setup()  {
      while (!Serial);  // Wait for Arduino Serial Monitor to open
      Serial.println("\n\n++++++++++++++++++++++");
    }
    
    #define NUM_SHOW sizeofFreeITCM
    void loop() {
      uint32_t ii, jj;
    
      Serial.println("\n    ============================  DMA TEST Single");
      for ( jj = 0; jj < 3; jj++ ) {
        for ( ii = 0; ii < DMA_SIZE; ii++ ) {
          pDMA[jj][ii] = ARM_DWT_CYCCNT;
        }
      }
      showDmaDiff( DMA_SIZE/10, 1 );
      Serial.println("\n    ============================  DMA TEST Single");
      showDmaDiff( DMA_SIZE, 0 );
      while (1) delay(1);
    }
    
    
    void showDmaDiff( uint32_t nCnt, uint32_t view ) {
    #define MAX_LOG 30
      uint32_t ii, jj;
      uint32_t cLog[MAX_LOG];
      for ( ii = 0; ii < MAX_LOG; ii++ ) cLog[ii] = 0;
      for ( jj = 0; jj < 3; jj++ ) {
        uint32_t ccL = 0, kk = 0, av = 0, tc = 0;
        for ( ii = 0; ii < nCnt; ii++ ) {
          kk++;
          if ( ii > 0 && ( pDMA[jj][ii] - pDMA[jj][ii - 1] ) != ccL ) {
            if ( view ) Serial.printf( "\npDMA[%u]@%x \t", jj, &pDMA[jj][ii] );
            ccL = pDMA[jj][ii] - pDMA[jj][ii - 1];
            if ( view ) Serial.printf( "after run of %4u Diff CycCnt is %3u \twith %u!=%u", kk, ccL, pDMA[jj][ii - 1], pDMA[jj][ii] );
            kk = 0;
          }
          tc++;
          av += ccL;
          if ( ccL < MAX_LOG )
            cLog[ccL]++;
          else {
            cLog[MAX_LOG - 1]++;
            Serial.printf( "\npDMA[%u]@%x diff=%u", jj, &pDMA[jj][ii], ccL );
          }
        }
        av += ccL;
        ccL = pDMA[jj][ii - 1] - pDMA[jj][ii - 2];
        if ( view) Serial.printf( "\nDONE after CNT of %4u Diff CycCnt is %3u \twith %u!=%u", kk, ccL, pDMA[jj][ii - 2], pDMA[jj][ii - 1] );
        Serial.printf( "\nAvg CycCnt for %u is %f\n", tc, (float)av / tc );
        for ( ii = 0; ii < MAX_LOG; ii++ ) {
          if (cLog[ii]) Serial.printf( "CC@%u=%u ", ii, cLog[ii] );
        }
        Serial.println("\n    ============================");
      }
    }

  24. #24
    Senior Member+ KurtE's Avatar
    Join Date
    Jan 2014
    Posts
    7,127
    Back to Interval Timer - The simple delta I did was simply to change the magic numbers that limit how low the value can go, as to allow it to go low enough to hopefully allow the users to get about the same Low value (high frequency) that they can get with a T3.2 at default CPU speed...

    I figured for this round do a really safe update as Arduino 1.8.11 will probably release in the next few days, and then hopefully Paul can get out a version of Teensyduino that works with the new Arduino very quickly, which would imply minimal changes...

    After this, I personally think we should enhance IntervalTimer to allow us to go faster. Maybe always or maybe only when we need it... My first attempt (assuming someone else does not update it before), might be something like:
    Default to peripheral speed - And maybe only if the sketch tries to get an interval that is so slow that it would overflow 32 bit value, then see if it makes sense to change it to 24mhz OSC value.
    The interesting problem is that this impacts both PIT and GPT. So question would be detecting these conditions...

    At a minimum I think the IntervalTimer code should at least look at the configuration of the CSCMR1 register to know what speed it is running at, and then use that, instead of hard wired for 24MHZ... @Paul - If you want I can try making another delta/PR that does this, so that in the mean time a sketch in it's setup code could setup to use faster speeds.

    Question is what about other code, example does the TeensyTImerTool (or will it) when it uses GPT look at CSCMR1 register to see and or modify the settings or does it simply assume 24mhz.

    Now as to current conversation going on. From what @Paul mentioned, it seems safer to stay with 150mhz as the default. But again wonder if we should allow for 200mhz or 300mhz in same way we allow for faster F_BUS for T3.x...

  25. #25
    Senior Member
    Join Date
    Apr 2014
    Location
    Germany
    Posts
    990
    I see two problems.

    1. For the T4 all 4 PIT timers run on one module (the T3.x had one module for each timer). That means, if you change clock or whatever, this affects all other interval timers as well. I can imagine that some external libraries innocently using a IntervalTimer will have a hard time coping with that.
    2. Making the timer faster is one thing, but you should really look at the load this generates. Currently the T4 gets already overloaded by 4 PITs just flipping pins at about 300kHz. So, making it possibly faster without solving the load issue might generate more issues than it solves. Further up in this thread I showed measurements with the TMRs since they run on 150MHz. One TMR overloads the T4 (can't even blink the LED) at about 4MHz, didn't do experiments with more timers running in parallel so far.


    TeensyTimerTool (or will it) when it uses GPT look at CSCMR1 register to see and or modify the settings or does it simply assume 24mhz.
    Currently it assumes 24Mhz but that can be improved of course. Generally i try to allow users do those settings on the timer module they grabbed (wip). This however only works if the settings are local to the timer module (like prescaler etc) don't know if one can set the clock on a dedicated timer module without affecting other peripherals?

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •