Need a little help with assembler

Status
Not open for further replies.

LouieG

Member
Hi All,

I'm new to Teensy/Arduino and definitely new to ARM assembly.
I have managed to create a program that reads digital pin 16 on a Teensy 3.1 n number of times.
I'm using the Arduino IDE with all the Teensy stuff installed.
My code runs fine, but I would like to know if it can be made faster, specifically, this loop:

Code:
     606:	f44f 717a 	mov.w	r1, #1000	; 0x3e8
     60a:	4b06      	ldr	r3, [pc, #24]	; (624 <loop1+0x16>) //load pin16 address
     60c:	4c06      	ldr	r4, [pc, #24]	; (628 <loop1+0x1a>) //load address of data array

0000060e <loop1>:
     60e:	681a      	ldr	r2, [r3, #0]  // read pin
     610:	f804 2f01 	strb.w	r2, [r4, #1]!  //store data and index the pointer
     614:	3901      	subs	r1, #1                  //dec
     616:	d1fa      	bne.n	60e <loop1>

From my calculations, it's running in 7 clicks(per loop), does anyone see how this could be reduced.
Even 1 clock tick would be great. Can the indexing mode of the store instruction somehow also be used as the loop counter? In other words, is there a mode of the strb.w instruction that would allow a conditional branch once the index got to 1000? (Thereby getting rid of the need for the subs r1, #1 instruction)

Here's the whole enchalada:


Code:
#define HWSERIAL Serial1
int sensorPin = 16;    // select the input pin 
int sensorValue = 0;  // variable to store the value 
int rcount=0;
const int tmax=2000;
unsigned long mstart;
unsigned long mstop;
byte data[5000];
int batchcount=1;
int dmax=10;
int dcnt=0;
int i;

void setup() {
  HWSERIAL.begin(115200);
  mstart=micros();
  pinMode(sensorPin,INPUT);
}

void loop() {
  ReadOnTrigger1();
}
void ReadOnTrigger1(){
  dcnt=dmax;
  do {
    if (digitalReadFast(sensorPin)>0) { //wait for pin to go low for dmax ticks
      dcnt--;  
    }
    else {
      dcnt=dmax; //pin went high, reset count
    }
  }   
  while (dcnt>0);
  FastRead();
  SendData();
}

void SendData(){
  HWSERIAL.print("Batch:");
  HWSERIAL.println(batchcount);
  HWSERIAL.print("Start Clock:");
  HWSERIAL.println(mstart);
  HWSERIAL.print(" End Clock:");
  HWSERIAL.println(mstop);

  for (int i=1;i<=tmax;i++){
    HWSERIAL.print(i);
    HWSERIAL.print(",     ");
    HWSERIAL.println(data[i]);
  }
  batchcount++;
  HWSERIAL.println("ENDDATA");
  HWSERIAL.flush();
  delay(250);
}

void FastRead(){
  data[1]=digitalReadFast(16);
  mstart=micros();
  asm volatile(
  "mov       r1, 1000\n\t"
  "ldr	r3, [pc, #24]\n\t"	//address of pin 16
  "ldr	r4, [pc, #24]\n\t"	//address of data[]
  "loop1:"
    "ldr	r2, [r3, #0]\n\t" //read pin 16
    "strb	r2, [r4, #1]!\n\t"  //store in data[]
    "subs r1,1 \n\t"                //dec
    "bne loop1  \n\t"
    :::
  "r1","r2","r3","r4"
    );
  mstop=micros();
}

Thanks for any help

Louie G
 
Last edited:
With register starting at 1,000 then do a dec and branch not zero to loop1

I got 100 micros from your loop ( 97 to 101)

Doing digitalReadFast(16); in for loop was 181 and a read and write was 214 (with no array assign)

void FastRead2() {
mstart = micros();
for ( int ii = 0; ii < 1000; ii ++ )
data[1] = digitalReadFast(16);
// digitalWriteFast(LED_BUILTIN, !digitalReadFast(LED_BUILTIN) );
mstop = micros();
}
 
* Except looking at the instruction set for my first time it seems there is no valid encoding for Dec_CBNZ ?
 
You could unroll the loop. In the most extreme case, you could simply replicate those 2 load and store instructions 1000 times. I believe they encode to 16 bits each, so that's only 32000 bytes of flash.
 
You can save one instruction by unrolling one time.
Since the load and store do not touch the flags, you can do the subtraction earlier
You might need to include a nop, I do not know the timing of the bne in this case, assuming it takes two cycles
Your loop probably takes 7 cycles
Code:
   loop:
0,1:  ldr	r2, [r3, #0]
2,3:  strb.w	r2, [r4, #1]!
4:    subs	r1, #1
5,6:  bne	loop1
By unrolling one time, you can get down to 6 cycles
Code:
   loop:
0,1:  ldr	r2, [r3, #0]
2,3:  strb	r2, [r4, #1]!
4:    subs	r1, #2
5:    nop
6,7:  ldr	r2, [r3, #0]
8,9:  strb	r2, [r4, #1]!
10,11:bne	loop1
If you do not care about proper timing (remember to also disable interrupts and no dma while the loop is busy), you can remove the nop and get to 5.5 cycles. By unrolling more, you get closer and closer to 4.
Proper timing can only be achieved with 6 cycles, or a fully unrolled loop, ie 1000 times unrolled.
 
Last edited:
Hi Paul.
Great little boards BTW.
I thought of unrolling the loop, but I wanted to be able to change n (easily) if needed.
But (said in a Homer Simpson voice) overrrclooockinnnng yoouu saaayyyy? It's already going at 96, it can go faster?
 
KPC, I was trying to figure what you meant by unrolling it once. Then it was like Duh! Clever! I will try that when I get home.
Combining that with over clocking . . . bwah hah haaaaa
 
Last edited:
You can make it flexible below the maximum unrolling value. If you unroll 1024 times, and then jump to the 1024 - nth set of instructions.
Edit: use defines/macros for unrolling, don't just copy 1024 times, it will be a maintainance nightmare

Code:
.rept 1024
        ldr     r2, [r3, #0]
        strb    r2, [r4, #1]!
.endr
 
Last edited:
You can make it flexible below the maximum unrolling value. If you unroll 1024 times, and then jump to the 1024 - nth instruction.

Other routines that unroll , say by 8, would test if number of loops is multiple of 8. If not, they would execute the remainder first, divide the counter by 8 and then enter the unrolled loop.
 
Bear in mind, things are never easy when it comes to performance. If you unroll too many times, you may blow out the i-cache, since code that used to fit in your i-cache no longer does. In addition, it may cause the compiler to use so many registers, you have register spilling going on. On some systems, jump tables are fairly slow compared to if/then/else statements due to branch prediction. If you have if/then/else, you want to order the tests, so the most frequently used case is first, and depending on the machine, you want the most frequent case to fall through or be a separate jump. On microprocessors without floating point hardware (i.e. teensys), avoid doing calculations in floating point.
 
Last edited:
I was thinking, could DMA help? Never actually played with DMA on the teensy, but from other experience I know that DMA could save a read/write cycle.
Or does DMA on the coretex M4 have a fixed auto-increment?
 
Do I understand your plan correctly? You want to sample pin data as fast as possible and store it in an array[1000]? I'd say DMA can do that with a continuous trigger.
 
christoph, Yes that is the plan.

So far my unmodified code does 1000 samples at a rate of 13.7M samples/sec.
I was hoping to get to 16M, which I think I can do with kpc's idea of a two-for-the-price-of-one loop unwrapping.
Paul also suggested fully unwrapping the loop, which kpc made clear that a macro could be immensely helpful with, so I might just try that too.
And then Paul showed me some hidden goodies in the "overclocking" department. (the words of Scotty come to mind "I'm given er all she's got!")
Side-question: Hey Paul, should I cool this when I punch it up to 168? Heat sink? Ice Cube? Dangle out window in frozen north?

Now as far as DMA? I've never done that in my life; I knew something like that might be possible, but how to do it and whether it would shave cycles, I surely don't know.
Will a continuous trigger/DMA be able to do a rate of 16M samples/sec? (boils down to 6 clock ticks @96MHz)

Thanks again everyone

Louie G
 
You might also want to make your assembly a bit more symbolic, this allow more flexibility for the compiler to place registers, instead of declaring all registers clobbered. Especially the address inputs are more defined. I suspect that the DigitalReadFast is just there to get the [pc,#24].
Code:
register uint32_t cnt = 1000, sample;
register uint8_t *dst = data;
asm volatile(
"loop1:"
    "ldr     %[sample], [%[src]]\n\t"
    "strb    %[sample], [%[dst], #1]!\n\t"
    "subs    %[cnt],1 \n\t"                
    "bne     loop1\n\t"
    : [cnt] "+r" (cnt), [dst] "+r" (dst), [sample] "=r" (sample)
    : [src] "r" (&CORE_PIN16_PINREG)
);

Edit: check this link for inline assembly http://www.ethernut.de/en/documents/arm-inline-asm.html
 
Last edited:
kpc, Heh, heh. Good call! Yup, that's why it's there. I was going to explain that, but I didn't want to get off track.
I originally programmed it with named parameters but I ran into problems.
I have used the "e" constraint with an Uno to send a pointer of an array. I tried that in this code and it gave me an "impossible constraint" error.
I couldn't figure it out, so I just stuck the
"data[1]=digitalReadFast(16);"
line in there and it gave me the address of the array AND the port. Then I hardcoded them in. Kind've a hassle, but it worked.
Soooo, is the parameter/constraint syntax different between Arduino and Teensy?
Just answered my own question: http://www.nongnu.org/avr-libc/user-manual/inline_asm.html

It's taking a while to sink in, but ARM and AVR are different. The light is getting brighter.

I will definitely use your code (and the link) to understand the syntax, I was going to eventually ask about sending a pointer so thanks again!

Louie G
 
Unlike other architectures, the arm architecture has no special registers for pointers etc. Any register can be used for everything, so although you can use other constraints, in general i just say it is a register and I don't care wether it is a pointer or whatever.
 
I had a look at the datasheet (chapter 21.4.4, DMA performance) and it seems that the desired 16 M requests per second is probably more than the DMA can provide. I'm not sure, though, so I'll open a separate thread to discuss that (here it is: https://forum.pjrc.com/threads/28174-DMA-performance).

Nonetheless:

DMA is really not that hard, and I think you should give it a try just for the experience. On the Teensy 3.1, 4 of the available 16 DMA channels provide a periodic trigger. One of the periodic interrupt timers (PITs) can be used to trigger the DMA channel, so the first thing to do would be to configure a PIT to generate a request at the desired frequency. You can also trigger the channel continuously (= as fast as possible). Have a look at the interface in DMAChannel.h

Regards

Christoph
 
Last edited:
The DMA channels have many cycles of latency, to copy the transfer descriptor registers into the DMA engine and do other setup stuff, before the actual bus cycles happen. This is well documented in the reference manual.

If you suffer that latency on every transfer, you'll never manage to achieve anything close to 16 million transfers per second.

Such speed (or faster) might be possible if done entirely with the "minor loop", where the transfer stays active in the DMA engine and runs as rapidly as possible. I'll be curious to hear how it goes, if you try.
 
Getting theeerrrree . . .

So, I tried a few things last nite.

1) I did the two-for-one loop. It worked great! Thanks kpc.
It brought me right to 16M samples per second which wound up being perfect.
I'm using this (right now) to "see" what my Uno is putting out on a digital pin.
I've got an assembly routine running on the Uno that just toggles pin 8. And I'm getting a 4MHz square wave(AFAICT it's square, probly not quite, I don't have a scope).
The output of the Teensy matches that perfectly, and whereas the Uno is 16Mhz and the Teensy is 96Mhz with a 6 cycle loop (errr 12/2 actually), the two are in near perfect lock step. (It seems to drift juuust a little every 10 to 15 times through). But I'm as happy as a pig in a poke, whatever that is.

2) I overclocked it . . . y'know in Star Trek where they push it and push it and then push it juuuuust a little too far? Yeah. So 120? Cool. 144? Yes! 168? Can I get a 168? Nahh. She gave it all she's got. But so cool to get it going that fast.

3) I unrolled the loop with the .rept macro. Niiiiiiice. Gave me a "truncated" something or other error. (Not home, cannot reproduce it here). But then I tested it by unrolling only 500 reps. That worked very well. But it was no longer in lock step with the uno, so the traces on screen kept changing in width.
(BTW, the Teensy is hooked up to my pc via serial and sends the data there, where I have a VB program displaying the "square wave" onscreen")

4) I changed the register references to named variables/symbols/parameters/arguments (don't know what to call them). It created an interesting problem:
Code:
000004dc <_Z8FastReadv>:
     4dc:	f44f 737a 	mov.w	r3, #1000	; 0x3e8
     4e0:	4a03      	ldr	r2, [pc, #12]	; (4f0 <loop1+0xc>)
     4e2:	4904      	ldr	r1, [pc, #16]	; (4f4 <loop1+0x10>)

000004e4 <loop1>:
     4e4:	6809      	ldr	r1, [r1, #0]  ;         <----- r1 clobbers r1  ?!?!?!?!?!?
     4e6:	f802 1f01 	strb.w	r1, [r2, #1]!
     4ea:	3b01      	subs	r3, #1
     4ec:	d1fa      	bne.n	4e4 <loop1>
     4ee:	4770      	bx	lr
     4f0:	1fff8820 	.word	0x1fff8820
     4f4:	400ff050 	.word	0x400ff050
It looks to me like it loads r1 with the data found at [r1], which then clobbers the address that was in r1, and then all goes bad.
I recreated the code below at work (on my break :)) and got the above assembly code:
Code:
byte data[2000];
void setup(){
Serial.begin(9600);

}
void loop(){
  
FastRead();

}
void FastRead()
{
register uint32_t cnt = 1000, sample;
register uint8_t *dst = data;
asm volatile(
"loop1:"
    "ldr     %[sample], [%[src]]\n\t"
    "strb    %[sample], [%[dst], #1]!\n\t"
    "subs    %[cnt],1 \n\t"                
    "bne     loop1\n\t"
    : [cnt] "+r" (cnt), [dst] "+r" (dst), [sample] "=r" (sample)
    : [src] "r" (&CORE_PIN16_PINREG)
);

}

So I dunno what to do about that. I tried all the different constraints (I think I did anyway). It kept throwing 'Impossible Constraint' errors.

Lastly, about the DMA stuff . . . I appreciate your confidence in me being able to comprehend this stuff.
I perused the DMAChannel.h stuff and it will be some time before I can comprehend C to that level.
So resist the urge to get up in the middle of the night to check my progress. ;)


So again, any help is welcome, and thanks for all the ideas so far.

Louie G
 
Last edited:
Can the Teensy's clock speed be changed in code?

Yes, it can.

In fact, there's several ways to do this.

The simplest way involves writing to the SIM_CLKDIV1 register, to change the clock dividers.

The chip has a pretty incredible number of clock features. You'll have to read the reference manual. There's a 4 MHz and 32 kHz internal oscillator, in addition to the crystal, so you have 3 fundamental clocks. There's a PLL and a FLL which can increase those, with various options to divide and multiply. Then after those are the simple dividers (controlled by SIM_CLKDIV1).

Configuring the clocks is tricky business. There a many different modes and a lot of complex requirements for limiting which modes can switch to which other modes (without crashing).

You might look at the code in mk20dx128.c, to see how it configures the clocks. Many different ways are in that file, depending on F_CPU set by the Tools > CPU Speed menu.

You might also look at Duff's Snooze library.

Did I mention it's complex?!
 
@LouieG, about the assembly, strange about the r1 register, I also don't understand what is happening, the savest way is probably to make every register read/writeable. I will check some more resources, since I also want to know if it is a bug, or my limited understanding of inline assembly.
Code:
register uint32_t *src = (uint32_t*) &CORE_PIN16_PINREG;
asm volatile(
"loop1:"
    "ldr     %[sample], [%[src]]\n\t"
    "strb    %[sample], [%[dst], #1]!\n\t"
    "subs    %[cnt],#1 \n\t"                
    "bne     loop1\n\t"
    : [cnt] "+r" (cnt), [dst] "+r" (dst), [sample] "=r" (sample), [src] "+r" (src) 
);

Compiles into
Code:
   8:   6808            ldr     r0, [r1, #0]
   a:   f802 0f01       strb.w  r0, [r2, #1]!
   e:   3b01            subs    r3, #1
  10:   d1fa            bne.n   8 <loop1>

Edit: Also asked the question on stackoverflow, maybe someone answers there
http://stackoverflow.com/questions/29243231/arm-cortex-m4-teensy-3-1-inline-assembly-constraints
 
Last edited:
@kpc

Yep, that did it!

Will try it on actual Teensy much later tonight.

Buy yerself a beer and pretend I paid for it.

LG
 
Status
Not open for further replies.
Back
Top