A little competition :-)

Status
Not open for further replies.

Frank B

Senior Member
A little competition:

You can win nothing but glory, honor, and your mention in the source code.

Code:
/* From coder.h, ORIGINAL:
 do y <<= n, clipping to range [-2^30, 2^30 - 1] (i.e. output has one guard bit) 
*/
#define CLIP_2N_SHIFT(y, n) {                   \
        int sign = (y) >> 31;                   \
        if (sign != (y) >> (30 - (n)))  {       \
            (y) = sign ^ (0x3fffffff);          \
        } else {                                \
            (y) = (y) << (n);                   \
        }                                       \
    }

Wanted is a terribly fast(er) implementation of this macro:

The rules:
a) Understand the macro.
b) Assembler is expressly permitted, but not a must. If c-code is used: Use GCC 4.8.4
c) Prove that your solution is faster. And correct.
d) There may be no quicker solution. That is bad luck.
e) It is fun to you.

Explanation: This macro is used in the Helix codecs (compressed audio). Not very often. But I do not like it. :)

Edit : it's for Teensy 3. y is signed int (32 Bit), n too, but only positive values for n. We use GCC 4.8.4.
n is NOT compile-time constant.
 
Last edited:
Be sure to label your code as for ARM only (int is 16-its on AVR platforms), or perhaps change the int's to int32_t or long. Obviously if people write in ASM, it will tie you to the ARM 32-bit architecture.

Also, on some platforms, signed shift right might not sign extend, but those systems may be fairly rare these days. I wrote a C front end for a machine (long since RIP) that did not do arithmetic shifts, and the ISO C standard explicitly says that it is implementation defined. Most implementations support arithmetic shifts, but you are not guaranteed about this.
 
Ok, it's for Teensy 3. y is signed int (32 Bit), n too, but only positive values for n. We use GCC 4.8.4.
 
Last edited:
Well, I just tried ARM's SSAT instruction, and I can confirm it does not work. :(

ARM's documentation isn't clear (at least to me) about how it saturates. But a little testing shows the shift is done first without saturation, and then the result is saturated.
 
Yes, i tried SSAT too, with no luck. OK: i'm willing to accept SSAT - if, at the end of the day - it sounds (the sound-output .. uh..my english is not good enough) )equal or better. Maybe this is the case ?
There is an other Macro "CLIPTOSHORT", and i used SSAT there. It did not make any difference.

Edit: i'm NOT a thumb-assembler-expert.
 
Last edited:
It's midnight here.. i have to go to work tomorrow...so please understand, that i can't answer more questions now :)
 
Not, that I can contribute to this optimization, but I find interesting, what the complier does
Code:
#define CLIP_2N_SHIFT(y, n) {   \
    int sign = (y)>>31;         \
    if(sign != (y)>>(30-(n))){  \
        (y)=sign ^ 0x3fffffff;  \
    } else {                    \
        (y)=(y)<<(n);           \
    }                           \
}

int F1(int y, int n)
{ 	CLIP_2N_SHIFT(y,n);
	return y;
}

becomes (in principle)
Code:
  15              		.text
  16              		.align	2
  17              		.global	F1
  18              		.thumb
  19              		.thumb_func
  20              		.type	F1, %function
  21              	F1:
  22              		@ args = 0, pretend = 0, frame = 0
  23              		@ frame_needed = 0, uses_anonymous_args = 0
  24              		@ link register save eliminated.
  25 0000 C1F11E02 		rsb	r2, r1, #30
  26 0004 40FA02F2 		asr	r2, r0, r2
  27 0008 C317     		asrs	r3, r0, #31
  28 000a 9A42     		cmp	r2, r3
  29 000c 1ABF     		itte	ne
  30 000e 83F04040 		eorne	r0, r3, #-1073741824
  31 0012 C043     		mvnne	r0, r0
  32 0014 8840     		lsleq	r0, r0, r1
  33 0016 7047     		bx	lr
  34              		.size	F1, .-F1
  35              		.ident	"GCC: (GNU Tools for ARM Embedded Processors) 4.8.4 20140725 (release) [ARM/embedded-4_8-br

It is always instructive to see what assembler will do with your code.
 
It seems that the following macro is about 6% faster
Code:
#define CLIP_2N_SHIFT2(y, n) {            \
	(y) = (y)<<(n);	                  \
	if((y)>  1<<30 -1) (y) =  1<<30-1;\
	if((y)< -1<<30)    (y) = -1<<30;  \
}

Also, the numbers of instructions are fewer.
 
It seems that the following macro is about 6% faster
Code:
#define CLIP_2N_SHIFT2(y, n) {            \
	(y) = (y)<<(n);	                  \
	if((y)>  1<<30 -1) (y) =  1<<30-1;\
	if((y)< -1<<30)    (y) = -1<<30;  \
}

Also, the numbers of instructions are fewer.

Further proof that simple code also is better: -)
Code:
	lsls	r0, r0, r1
	cmp	r0, #536870912
	bgt	.L7
	cmp	r0, #-1073741824
	it	lt
	movlt	r0, #-1073741824
	bx	lr
.L7:
	mov	r0, #536870912
	bx	lr

Btw, GCC 4.9 produces the same output.
 
You forgot that the original code correctly saturates when the left-shift overflows. It's the same problem that Paul describes above with __SSAT(y<<n, 31). Bits are lost during the shift before saturation takes place.

Here is an alternate method:

Code:
int CLIP_2N_SHIFT(int x, int n)
{ 
	int y = __SSAT(x << n, 31);
	if (x != y >> n)
		y = 0x3fffffff ^ (x >> 31);
	return y;
}

that can be implemented in two less instructions than the original, if the constant is hoisted outside the loop:

Code:
// outside the loop
        MVN      r12,#0xc0000000

// r2 = CLIP_2N_SHIFT(r0, r1)
        LSL      r2,r0,r1
        SSAT     r2,#31,r2
        ASR      r1,r2,r1
        CMP      r1,r0
        IT       NE
        EORNE    r2,r12,r0,ASR #31
 
Thanks sublinear,
i don't think that the constant can moved outside.
But one cycle less is one cycle less ( Again , it 's not worth it to really optimize it, but it's a brain teaser :- )

There is a similar macro, which is used more often, without shift:
Code:
//clip to [-2^n, 2^n-1], valid range of n = [1, 30]
#define CLIP_2N(y, n) { \
	int sign = (y) >> 31;  \
	if (sign != (y) >> (n))  { \
		(y) = sign ^ ((1 << (n)) - 1); \
	} \
}
 
But one cycle less is one cycle less ( Again , it 's not worth it to really optimize it, but it's a brain teaser :- )

Nevertheless, it was a nice exercise and I refreshed my (limited) Assembler skills.
Also concerning optimization, here a bit, there a bit, could result in something useful.

BTW: Anyone knows how to program in C, or tell the compiler the following:
Code:
        IT       NE
        EORNE    r2,r12,r0,ASR #31
 
Nevertheless, it was a nice exercise and I refreshed my (limited) Assembler skills.
Also concerning optimization, here a bit, there a bit, could result in something useful.
I agree.

My Assembler skills are 20..25 years old (z80, 6502)..since then..nothing. The only ARM on the market was the ACORN., i think....

Code:
        IT       NE
        EORNE    r2,r12,r0,ASR #31

AFAIK: IT is "IF-THEN", make the next instruction(s) conditional.It is not a "real" instruction and does not need space.
EOR NE is then the conditional instruction, only executed when NE flag is set ( i can be wrong )

So.. i think there is no direct translation to c.
EOR is ^ in C

In pseudocode: if last operation was (edit: NOT) "eqal" then r2 = r12 ^ (r0 << 31) (i hope this is correct ?!)
 
Last edited:
@Frank
Is the following not what you wanted (here only for bit 9 and with positive and negative numbers), or do I miss something?

Code:
int nn;
void setup()
{ 
}

void loop()
{
  int ii,jj,nn;
  char text[80];
  int yy1,yy2;

  while(1)
  {  delay(1000);
    nn=2;
    for(ii=0; ii<100;ii+=5)
    {
      asm volatile("ssat %0, %1, %2" : "=r" (yy1) : "I" (9), "r" (-ii<<nn));
      asm volatile("ssat %0, %1, %2" : "=r" (yy2) : "I" (9), "r" (ii<<nn));
      sprintf(text,"%d %d %x %d %x",ii,yy1,yy1,yy2,yy2);
      Serial.println(text);
    }
  }
}

which results to
Code:
0 0 0 0 0
5 -20 ffffffec 20 14
10 -40 ffffffd8 40 28
15 -60 ffffffc4 60 3c
20 -80 ffffffb0 80 50
25 -100 ffffff9c 100 64
30 -120 ffffff88 120 78
35 -140 ffffff74 140 8c
40 -160 ffffff60 160 a0
45 -180 ffffff4c 180 b4
50 -200 ffffff38 200 c8
55 -220 ffffff24 220 dc
60 -240 ffffff10 240 f0
65 -256 ffffff00 255 ff
70 -256 ffffff00 255 ff
75 -256 ffffff00 255 ff
80 -256 ffffff00 255 ff
85 -256 ffffff00 255 ff
90 -256 ffffff00 255 ff
95 -256 ffffff00 255 ff

It seems to me, that there is no need to check on negative numbers, or am I wrong?

The assembler has only two lines (apart from moving data to and from registers)
Code:
  60 0024 04FA02F2 		lsl	r2, r4, r2
  61              	@ 110 "src/main.c" 1
  62 0028 02F30802 		ssat r2, #9, r2
  63              	@ 0 "" 2
  64              		.thumb
 
Code:
int nn;
void setup()
{ 
  delay(1000);
}

/* From coder.h, ORIGINAL:
 do y <<= n, clipping to range [-2^30, 2^30 - 1] (i.e. output has one guard bit) 
*/
#define CLIP_2N_SHIFT(y, n) {                   \
        int sign = (y) >> 31;                   \
        if (sign != (y) >> (30 - (n)))  {       \
            (y) = sign ^ (0x3fffffff);          \
        } else {                                \
            (y) = (y) << (n);                   \
        }                                       \
    }

void loop()
{
  int ii,nn;
  int yy1,yy2;

  while(1)
  {  
    Serial.println();
    delay(1000);
    nn=0;
    for(ii=((1<<30)-5); ii<((1<<30)+5);ii++)
    {
      asm volatile("ssat %0, %1, %2" : "=r" (yy1) : "I" (30), "r" (-ii<<nn));
      asm volatile("ssat %0, %1, %2" : "=r" (yy2) : "I" (30), "r" (ii<<nn));
      int z1 = ii;
      CLIP_2N_SHIFT(z1,nn);
      int z2 = -ii;
      CLIP_2N_SHIFT(z2,nn);
      Serial.printf("%x: %d %x %d %x   !=    %x %x\r\n",ii,yy1,yy1,yy2,yy2, z1, z2);
 
    }
  }
}
hmm...
 
Code:
hmm...[/QUOTE]

but replace by 30 by 31 (one bit above the maximum number, in my previous case 9 for -256 to 255 (256 =2^8, right?)
as in
[CODE]
int nn;
void setup()
{ 
  delay(1000);
}

/* From coder.h, ORIGINAL:
 do y <<= n, clipping to range [-2^30, 2^30 - 1] (i.e. output has one guard bit) 
*/
#define CLIP_2N_SHIFT(y, n) {                   \
        int sign = (y) >> 31;                   \
        if (sign != (y) >> (30 - (n)))  {       \
            (y) = sign ^ (0x3fffffff);          \
        } else {                                \
            (y) = (y) << (n);                   \
        }                                       \
    }

void loop()
{
  int ii,nn;
  int yy1,yy2;

  while(1)
  {  
    Serial.println();
    delay(1000);
    nn=0;
    for(ii=((1<<30)-5); ii<((1<<30)+5);ii++)
    {
      asm volatile("ssat %0, %1, %2" : "=r" (yy1) : "I" (31), "r" (-ii<<nn));
      asm volatile("ssat %0, %1, %2" : "=r" (yy2) : "I" (31), "r" (ii<<nn));
      int z1 = -ii;
      CLIP_2N_SHIFT(z1,nn);
      int z2 = ii;
      CLIP_2N_SHIFT(z2,nn);
      Serial.printf("%x: %x %x   ==    %x %x\r\n",ii,yy1,yy2, z1, z2);
 
    }
  }
}

na also
 
but replace by 30 by 31 (one bit above the maximum number, in my previous case 9 for -256 to 255 (256 =2^8, right?)
as in
Code:
int nn;
void setup()
{ 
  delay(1000);
}

/* From coder.h, ORIGINAL:
 do y <<= n, clipping to range [-2^30, 2^30 - 1] (i.e. output has one guard bit) 
*/
#define CLIP_2N_SHIFT(y, n) {                   \
        int sign = (y) >> 31;                   \
        if (sign != (y) >> (30 - (n)))  {       \
            (y) = sign ^ (0x3fffffff);          \
        } else {                                \
            (y) = (y) << (n);                   \
        }                                       \
    }

void loop()
{
  int ii,nn;
  int yy1,yy2;

  while(1)
  {  
    Serial.println();
    delay(1000);
    nn=0;
    for(ii=((1<<30)-5); ii<((1<<30)+5);ii++)
    {
      asm volatile("ssat %0, %1, %2" : "=r" (yy1) : "I" (31), "r" (-ii<<nn));
      asm volatile("ssat %0, %1, %2" : "=r" (yy2) : "I" (31), "r" (ii<<nn));
      int z1 = -ii;
      CLIP_2N_SHIFT(z1,nn);
      int z2 = ii;
      CLIP_2N_SHIFT(z2,nn);
      Serial.printf("%x: %x %x   ==    %x %x\r\n",ii,yy1,yy2, z1, z2);
 
    }
  }
}

na also

Ok.. hast du nn=1 versucht ?
 
Does this need to support all 31 possible values of "n"? Or are only certain, compile-time constant values used?

If only 2 or 3 values are ever used for n, and they're constants, maybe each case could be optimized separately?
 
Status
Not open for further replies.
Back
Top