trying to port some assembly.

Status
Not open for further replies.

neutron7

Well-known member
I have been working on this thing: there is a teensy 3.6 and some multiplexers and output op amps on the back, not much else except power and supervisor.
21122358_1615308041877833_1484440016237727740_o.jpg


One thing it does, is generate waveforms based on sine waves which use this code. (taken from this site) http://www.coranac.com/2009/07/sines/

Code:
static inline int32_t SIN3(int32_t x)
{
  // S(x) = x * ( (3<<p) - (x*x>>r) ) >> s
  // n : Q-pos for quarter circle             13
  // A : Q-pos for output                     12
  // p : Q-pos for parentheses intermediate   15
  // r = 2n-p                                 11
  // s = A-1-p-n                              17

  static const int qN = 13, qA = 12, qP = 15, qR = 2 * qN - qP, qS = qN + qP + 1 - qA;

  x = x << (30 - qN);     // shift to full s32 range (Q13->Q30)
  if ( (x ^ (x << 1)) < 0) // test for quadrant 1 or 2
    x = (1 << 31) - x;
  x = x >> (30 - qN);
  return x * ( (3 << qP) - (x * x >> qR) ) >> qS;
}

it works very well, no need for interpolation or anything, but he mentions it can run even faster using the ARM assemble code he provided.(as fast as a wavetable lookup with interpolation) That would be great, because the more oscillators, the better!

I have no idea how the assembler code works, but i have looked at some examples from the PJRC audio library, but to be honest i have little idea what i am doing :)

Code:
static inline int32_t iSIN3(int32_t x)
{  
  int32_t out;
  //register int32_t _r0 asm("r0");
  asm volatile (
    "mov     r0, r0, lsl #(30-13) \n"
    "teq     r0, r0, lsl #1       \n"
    "rsbmi   r0, r0, #1<<31       \n"
    "mov     r0, r0, asr #(30-13) \n"
    "mul     r1, r0, r0           \n"
    "mov     r1, r1, asr #11      \n"
    "rsb     r1, r1, #3<<15       \n"
    "mul     r0, r1, r0           \n"
    "mov     r0, r0, asr #17      \n"
    "bx      lr                   \n"
    : "=r0" (out)
    : "r0" (x)
);
return out;
}

here is what i have so far, i get "matching constraint not valid in output operand" when i try to compile it.
 
Last edited:
That appears to be using the ARM instruction set and will have to be converted to the Cortex M4 (thumb mode) set.

At the very least you will have to insert "it mi \n" before the rsbmi instruction.
 
thank you, now it compiles but unfortunately now the teensy stops when the function is called.

Code:
static inline int32_t iSIN3(int32_t x)
{
  int32_t out;
  asm volatile (
    "mov     %0, %0, lsl #(30-13) \n\t"
    "teq     %0, %0, lsl #1       \n\t"
    "it mi                        \n\t"
    "rsbmi   %0, %0, #1<<31       \n\t"
    "mov     %0, %0, asr #(30-13) \n\t"
    "mul     %1, %0, %0           \n\t"
    "mov     %1, %1, asr #11      \n\t"
    "rsb     %1, %1, #3<<15       \n\t"
    "mul     %0, %1, %0           \n\t"
    "mov     %0, %0, asr #17      \n\t"
    "bx      lr                   \n"
    : "=r" (out)
    : "r" (x)
  );
  return out;
}
 
I would look at the assmebly of the c function version and take it from there.

The "bx lr" instruction will branch and never get to your return statement:
Code:
return out
 
I took a look at the code generated by the C version using objdump -S and I don't see that the assembly language version is an improvement:

Code:
 648:   0443            lsls    r3, r0, #17
 64a:   ea93 4080       eors.w  r0, r3, r0, lsl #18
 64e:   bf48            it      mi
 650:   f1c3 4300       rsbmi   r3, r3, #2147483648     ; 0x80000000
 654:   145b            asrs    r3, r3, #17
 656:   fb03 f203       mul.w   r2, r3, r3
 65a:   12d2            asrs    r2, r2, #11
 65c:   f5c2 32c0       rsb     r2, r2, #98304  ; 0x18000
 660:   fb03 f002       mul.w   r0, r3, r2
 664:   1440            asrs    r0, r0, #17
 666:   4770            bx      lr

(I compiled using "-O2")

All of the mov instructions included a shift and the compiler uses an explicit shift instruction instead. The Cortex M4 documentation says that this is the preferred syntax without mentioning that the shift encoding is smaller. The result is that the compiler generated code is four 16 bit words shorter.

So in effect the compiler generates the same code with a shorter encoding. I bet it runs faster too.
 
Thank you, i tested the timing on the "almost working" verion (which outputs half a sine) and it did not run any faster, it was worth a try though!
 
Status
Not open for further replies.
Back
Top