Forum Rule: Always post complete source code & details to reproduce any issue!
Results 1 to 17 of 17

Thread: Maximum GPIO toggling, (Teensy 3.2)

  1. #1
    Senior Member
    Join Date
    Nov 2015
    Location
    Wales
    Posts
    579

    Maximum GPIO toggling, (Teensy 3.2)

    Hey guys,

    I'm looking to use a Teensy 3.2 to write 16 bits of parallel data out at between 10MHz and 40MHz.
    To test if this is possible I've written what 'I' think to be the fastest toggling of a pin possible but can only seem to achieve at best 10.3MHz

    Code:
    void setup() {
    
    	// configure the 8 output pins of Port D
    	pinMode(2, OUTPUT);		// #0
    	pinMode(14, OUTPUT);	        // #1
    	pinMode(7, OUTPUT);		// #2
    	pinMode(8, OUTPUT);		// #3
    	pinMode(6, OUTPUT);		// #4
    	pinMode(20, OUTPUT);	        // #5
    	pinMode(21, OUTPUT);	        // #6
    	pinMode(5, OUTPUT);		// #7
    	GPIOD_PDOR = 0x00000000;
    	
    	pinMode(pclkInterruptPin, INPUT);
    }
    
    void loop() {
    	
    	noInterrupts();
    	clocking();
    	
    }
    
    FASTRUN static void clocking(void) {
    	begin:
    		asm (
    		
    		"mvns %0, %0\n\t"	// Invert Port D
    		
    		: "+r" (GPIOD_PDOR)
    		:
    	);
    	goto begin;
    	
    }
    This is compiled with -DF_CPU=144000000 and -O3 to produce:
    Code:
    1fff8760 <_ZL8clockingv>:
    1fff8760:	4a02      	ldr	r2, [pc, #8]	; (1fff876c <_ZL8clockingv+0xc>)
    1fff8762:	6813      	ldr	r3, [r2, #0]
    1fff8764:	43db      	mvns	r3, r3
    1fff8766:	6013      	str	r3, [r2, #0]
    1fff8768:	e7fb      	b.n	1fff8762 <_ZL8clockingv+0x2>
    Click image for larger version. 

Name:	20151106-0001_01.gif 
Views:	181 
Size:	28.4 KB 
ID:	5455

    Am I going about this the wrong way? Or can this be sped up?
    Last edited by Xenoamor; 11-06-2015 at 02:48 PM.

  2. #2
    Senior Member+ Theremingenieur's Avatar
    Join Date
    Feb 2014
    Location
    Colmar, France
    Posts
    2,551
    Is the repeated calling of noInterrupts() in each loop run really necessary?
    Can't you "prepare" the data to output before using DMA to write rapidly the port register?
    But finally, I fear that outputting parallel data at 40 MHz remains somewhat beyond the Teensy3.x faculties. There have been special SCSI controller chips for that purpose, when I was younger... ;-)

  3. #3
    Senior Member
    Join Date
    Nov 2015
    Location
    Wales
    Posts
    579
    Quote Originally Posted by Theremingenieur View Post
    Is the repeated calling of noInterrupts() in each loop run really necessary?
    Can't you "prepare" the data to output before using DMA to write rapidly the port register?
    But finally, I fear that outputting parallel data at 40 MHz remains somewhat beyond the Teensy3.x faculties. There have been special SCSI controller chips for that purpose, when I was younger... ;-)
    Thanks for the speedy response Theremingenieur!
    The function 'clocking' actually is an endless loop so noInterrupts() only gets called the once. I will be using DMA for the real data transfer but for this I am just XORing Port D with 0xFFFFFFFF. This is only one ASM instruction so should be pretty fast for a test.

    Looking in more detail 144Mhz/14cycles = 10.29Mhz so this seems to be the magic number which is generating the frequency I'm witnessing. Seems odd that the 5 ASM instructions are 14 cycles though. I'd confirm it but I'm finding it hard to find a detailed documentation for CortexM4 assembly instructions
    Last edited by Xenoamor; 11-07-2015 at 01:24 AM.

  4. #4
    Senior Member+ Theremingenieur's Avatar
    Join Date
    Feb 2014
    Location
    Colmar, France
    Posts
    2,551
    Sorry, didn't see the proprietary loop in the function. Seems that my eyes do "go to" filtering since the ancient BASIC times ;-)
    The Cortex M4 reference is here. (From which you can see that ldr takes usually 3 cycles)
    Just for my scientific curiosity, since one never knows what this arduino translator deals with the compiler, does it make a difference when you put everything in the setup() and let the loop() empty?
    Last edited by Theremingenieur; 11-06-2015 at 05:21 PM.

  5. #5
    Seems like it could be branching to 1fff8764 instead of 1fff8762, did you see what the compiler produced on its own if you just flip the local and set GPIOD_PDOR?

  6. #6
    Senior Member
    Join Date
    Nov 2015
    Location
    Wales
    Posts
    579
    This:
    Code:
    FASTRUN static void clocking(void) {
    	begin:
    	GPIOD_PDOR ^= 0xFFFFFFFF;
    	goto begin;
    }
    Generates this:
    Code:
    1fff8760 <_ZL8clockingv>:
    1fff8760:	4b02      	ldr	r3, [pc, #8]	; (1fff876c <_ZL8clockingv+0xc>)
    1fff8762:	681a      	ldr	r2, [r3, #0]
    1fff8764:	43d2      	mvns	r2, r2
    1fff8766:	601a      	str	r2, [r3, #0]
    1fff8768:	e7fb      	b.n	1fff8762 <_ZL8clockingv+0x2>
    1fff876a:	bf00      	nop
    1fff876c:	400ff0c0 	andmi	pc, pc, r0, asr #1
    Which is exactly the same as my own assembly set.


    Why can't I just do this? It seems to mean like it would be significantly faster
    Code:
    1fff8764:	ea6f 0303 	mvn.w	r3, r3
    1fff8768:	f7ff bffc 	b.w	1fff8764 <begin>
    The program just hangs and the port never gets toggled. I've also tried subtracting from the PC to skip backwards but the result is the same

  7. #7
    Quote Originally Posted by Xenoamor View Post
    This:
    Code:
    FASTRUN static void clocking(void) {
    	begin:
    	GPIOD_PDOR ^= 0xFFFFFFFF;
    	goto begin;
    }
    What about:

    Code:
    FASTRUN static void clocking(void) {
            uint32_t toggle = GPIOD_PDOR;
    	begin:
            toggle ^= 0xFFFFFFFF;
    	GPIOD_PDOR = toggle;
    	goto begin;
    }
    You shouldn't have to read port D because you know what's there, but it's volatile so it can't optimize away the read for you.

  8. #8
    Senior Member
    Join Date
    Nov 2015
    Location
    Wales
    Posts
    579
    Quote Originally Posted by ecurtz View Post
    What about:

    Code:
    FASTRUN static void clocking(void) {
            uint32_t toggle = GPIOD_PDOR;
    	begin:
            toggle ^= 0xFFFFFFFF;
    	GPIOD_PDOR = toggle;
    	goto begin;
    }
    You shouldn't have to read port D because you know what's there, but it's volatile so it can't optimize away the read for you.
    Thanks for taking the time ecurtz.

    Doing so compiles to:
    Code:
    1fff8760 <_ZL8clockingv>:
    1fff8760:	4b02      	ldr	r3, [pc, #8]	; (1fff876c <_ZL8clockingv+0xc>)
    1fff8762:	681b      	ldr	r3, [r3, #0]
    1fff8764:	4a01      	ldr	r2, [pc, #4]	; (1fff876c <_ZL8clockingv+0xc>)
    1fff8766:	43db      	mvns	r3, r3
    1fff8768:	6013      	str	r3, [r2, #0]
    1fff876a:	e7fc      	b.n	1fff8766 <_ZL8clockingv+0x6>
    1fff876c:	400ff0c0 	andmi	pc, pc, r0, asr #1
    What you've posted has cut out the LDR command. It'll have to wait until Monday for testing but thank you. It should speed it up by at least two cycles!

    EDIT:
    I'll have a look into a DMA triggered from an interrupt but I'd be surprised if it's any faster. I'm not too sure what the overheads are for it but I maybe able to stagger the 4 channels to get a fast write/read speed
    Last edited by Xenoamor; 11-06-2015 at 06:36 PM.

  9. #9
    Senior Member+ Theremingenieur's Avatar
    Join Date
    Feb 2014
    Location
    Colmar, France
    Posts
    2,551
    You could double the output speed by using 16 parallel bits and then multiplexing these externally into 2x8... just thinking aloud... :-)

  10. #10
    Senior Member
    Join Date
    Nov 2015
    Location
    Wales
    Posts
    579
    It's probably best to describe what I'm doing here;

    I'm looking to use a MAX9205 LVDS serialiser paired with a MAX9206 deserialiser. On the sending side I'll actually be using something quit beefy, I'm thinking a quad-core Cortex M4 sort of jobby who's role will be to process and stream video.

    The deserialisers will be on low cost LED panels. To keep the cost down I'd like to use an MK10/20 for this. So ideally I'm looking to read and process data at 16-45MHz. If I can get the 4 DMA channels working well this should be feasible. I actually have a while to process the data once it's received but it has to be read and stored at those speeds

  11. #11
    Senior Member
    Join Date
    Nov 2015
    Location
    Wales
    Posts
    579
    Quote Originally Posted by Theremingenieur View Post
    Sorry, didn't see the proprietary loop in the function. Seems that my eyes do "go to" filtering since the ancient BASIC times ;-)
    The Cortex M4 reference is here. (From which you can see that ldr takes usually 3 cycles)
    Just for my scientific curiosity, since one never knows what this arduino translator deals with the compiler, does it make a difference when you put everything in the setup() and let the loop() empty?
    Sorry I seemed to have missed this for some reason...
    But thanks, I've bookmarked that reference and printed out the instruction list. I have the feeling I'm going to be using this a lot from here on out.

    The arduino main function is as follows:
    Code:
    #include <Arduino.h>
    
    int main(void)
    {
        init();
    #if defined(USBCON)
        USBDevice.attach();
    #endif
        setup();
        for (;;) {
            loop();
            if (serialEventRun) serialEventRun();
        }
        return 0;
    }
    As my loop is never ending it can be placed in either setup() or loop() and it will compile exactly the same. The snag is if you let the loop() complete you run the added serialEventRun command which can drop precious clock cycles. Obviously this is needed for serial/usb communication though so it's a trade off as usual

  12. #12
    Senior Member
    Join Date
    Nov 2015
    Location
    Wales
    Posts
    579
    Quote Originally Posted by Xenoamor View Post
    Doing so compiles to:
    Code:
    1fff8760 <_ZL8clockingv>:
    1fff8760:	4b02      	ldr	r3, [pc, #8]	; (1fff876c <_ZL8clockingv+0xc>)
    1fff8762:	681b      	ldr	r3, [r3, #0]
    1fff8764:	4a01      	ldr	r2, [pc, #4]	; (1fff876c <_ZL8clockingv+0xc>)
    1fff8766:	43db      	mvns	r3, r3
    1fff8768:	6013      	str	r3, [r2, #0]
    1fff876a:	e7fc      	b.n	1fff8766 <_ZL8clockingv+0x6>
    1fff876c:	400ff0c0 	andmi	pc, pc, r0, asr #1
    This yielded 18MHz. I'm not moving over to staggering DMA channels to speed this up.
    Cheers for the pointers guys

  13. #13
    Senior Member
    Join Date
    Sep 2015
    Location
    Taiwan, Asai. (Traditional Chinese)
    Posts
    159
    reading through the posts at
    knowing digitalWriteFast and etc,
    thanks to Paul

    and I would like to know is there any PORT register reference for Teensy 3.1? like GPIO_PDOR
    according to the data sheet pin map of MK20DX256, the pin names gone to like PTA12, PTA12, PTB19,
    does that mean I got more than 8 pins in a port?

  14. #14
    Senior Member mortonkopf's Avatar
    Join Date
    Apr 2013
    Location
    London, uk
    Posts
    911
    Have a read of this post regarding port pin assignment, it probably contains what you are after: https://forum.pjrc.com/threads/17532...PIO_PDIR-_PDOR

  15. #15
    Senior Member
    Join Date
    Sep 2015
    Location
    Taiwan, Asai. (Traditional Chinese)
    Posts
    159
    @mortonkopf : thanks, Great! it's exactly what I want, but 32 bit at once is a little annoying

  16. #16
    Senior Member
    Join Date
    Nov 2015
    Location
    Wales
    Posts
    579
    Quote Originally Posted by Po Ting View Post
    but 32 bit at once is a little annoying
    It's a 32bit processor... :L

    You can use a union to make life easier for yourself if you only want to address set spaces of a port's register

  17. #17
    Senior Member PaulStoffregen's Avatar
    Join Date
    Nov 2012
    Posts
    21,298
    When changing only some of the bits within a register, don't forget to consider interrupts. If your code reads, modifies and write the register, and so can an interrupt, you need to protect the read-modify-write sequence.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •