Disassembly / Other IDEs

Cosford · Dec 5, 2016

Hi All,

Proposing using a Teensy 3.6 for a university project, looking at architecture and compiler optimisation of audio processing.
I intend to write my own demo code for simple audio processing for comparison. In order to achieve these aims, I would like to have a look at the disassembly. Is there some way of doing this using teensyduino?

Regards,
Cosford.

spex · Dec 5, 2016

Somewhere in the whole Teensyduino tree should be the cross toolchain related files (cross compiler and binutils), which includes gcc/g++, ld, as, strip, but also objdump of some kind. This can be used to show the assembly of the resulting ELF binary, using -d or -D, maybe in combination with -S to add the actual source code as well. Note that -S only works on a non-stripped ELF file.

Cosford · Dec 5, 2016

I'm hoping that I may be able to view the type of disassembly where it breaks down the higher level instruction lines and shows how they were implemented. Is that the -S configuration you're referring to.

spex · Dec 5, 2016

In a way, yes. With optimizations you may have multiple source lines getting merged in a way, which makes at times difficult to say where one line of original source ends and the next starts. To know what is really going on you have to go through the assembly yourself. That also implies knowing the (E)ABI, so you can track function/method parameters, local variables and things like that.

Debugging through the code step by step is usually another option, but Teensy is not made with JTAG in mind nor have I seen anything gdb related. If it is just about algorithms, you might try compiling the whole thing for something QEmu supports for Cortex M4 based SoCs/CPUs, where you have more options in the debugging department, as there are various GUIs supporting gdb, which in some cases includes assembly level view, too.

defragster · Dec 5, 2016

This link was posted that I think helps to get assembly source - shared on the WIKI_COMING thread:

Generate an assembler listing when compiling Teensy
How-can-I-generate-an-assembler-listing-when-compiling-Teensy

MichaelMeissner · Dec 5, 2016

Cosford said:
I'm hoping that I may be able to view the type of disassembly where it breaks down the higher level instruction lines and shows how they were implemented. Is that the -S configuration you're referring to.

You would need to look at the GCC compiler sources for that. Unless you are looking for something simple, it can take months (or more) before people can become used to the GCC internals to be able to trace a particular optimization. Generally, you probably want to mostly treat the compiler as a black box, and concentrate more on the audio library (unless you want to learn the compiler internals in general).

Cosford · Dec 5, 2016

Great, thanks for the help all. It sounds like objdump will probably do what I have in mind.
This project isn't intended as an in-depth analysis of GCC or optimisation; rather, as part of an overall Audio processing project, I hope to demonstrate how performance on different processor architectures varies (naturally), and having access to show how the assembly code at the end of it will of course vary between these architectures, helps in demonstrating my point.

Thanks all.

MichaelMeissner · Dec 5, 2016

Note, particularly for audio, you may need to look at extended builtin functions on a per architecture basis. For example, the ARM Cortex M4F has a bunch of SIMD (single instruction, multiple data) instructions that can be used in audio processing. Sometimes the compiler can generate these automatically if you use '-O3' optimization level and use just the right types, but more generally, you have to write explicit calls in your code:

https://www.arm.com/products/processors/technologies/dsp-simd.php

Cosford · Dec 5, 2016

Hi Michael; Yes it's these types of instructions that I intended to talk about; I did assume that it would generally pick those up automatically so thanks for that. I'll give it a shot and see what comes of it.

PaulStoffregen · Dec 5, 2016

Cosford said:
Yes it's these types of instructions that I intended to talk about; I did assume that it would generally pick those up automatically ...

No, the compiler pretty much never uses the special DSP extension instructions.

If you look at the audio library code, they're defined in utility/dspinst.h as inline functions with inline assembly. If you look around the rest of the source, you'll find those function names used in many locations.

You'll see names like "16b" and "16t", meaning the bottom or top 16 bits of a 32 bit variable are used as a signed integer. Despite all the talk of SIMD, the most common speedup these instructions offer isn't the actual math, it's allowing 16 bit signed integers to be packed into 32 bit data. This doubles the bandwidth of reading and writing samples from RAM, and it effectively doubles the amount of data you can keep in the CPU's registers. Well, only if all the stuff you want to do with the 16 bit numbers can be done with these instructions. As soon as you need to mask off 16 bits or shuffle samples around to get them into their own registers, you've lost the speedup.

There's also a really useful 16x32 multiply which produces a 48 bit result, and then discards the low 16 bits (with or without roundoff). That turns out to be really handy, as you can see from its usage in many places.

Much of the optimization effort revolves around planning how much fits into the CPU's registers. Cortex-M4 also has a memory optimization, where the first read/write takes 2 cycles, but subsequent ones happen with only 1 cycle. So if you're able to dedicate 4 of the 13 available ARM register to audio samples, you can bring in 8 samples with only 5 clock cycles. It all becomes a game of doing the most you can with the registers, without forcing the compiler to spill anything onto the stack. That 16x32 multiply is particularly handy, since its result takes only a single output register and doesn't temporarily waste another register just to handle the intermediate 48 bit result and discard the low 16 bits.

This sort of strategic planning of the algorithm and data processing strategy around the register set and special instructions that don't conform to C language semantics of 8, 16, 32, 64 bit variables is far beyond how the compiler can optimize ordinary code.

Disassembly / Other IDEs

Cosford

Well-known member

spex

Active member

Cosford

Well-known member

spex

Active member

defragster

Senior Member+

MichaelMeissner

Senior Member+

Cosford

Well-known member

MichaelMeissner

Senior Member+

Cosford

Well-known member

PaulStoffregen

Well-known member