Disassembly / Other IDEs

Status
Not open for further replies.

Cosford

Well-known member
Hi All,

Proposing using a Teensy 3.6 for a university project, looking at architecture and compiler optimisation of audio processing.
I intend to write my own demo code for simple audio processing for comparison. In order to achieve these aims, I would like to have a look at the disassembly. Is there some way of doing this using teensyduino?

Regards,
Cosford.
 
Somewhere in the whole Teensyduino tree should be the cross toolchain related files (cross compiler and binutils), which includes gcc/g++, ld, as, strip, but also objdump of some kind. This can be used to show the assembly of the resulting ELF binary, using -d or -D, maybe in combination with -S to add the actual source code as well. Note that -S only works on a non-stripped ELF file.
 
I'm hoping that I may be able to view the type of disassembly where it breaks down the higher level instruction lines and shows how they were implemented. Is that the -S configuration you're referring to.
 
In a way, yes. With optimizations you may have multiple source lines getting merged in a way, which makes at times difficult to say where one line of original source ends and the next starts. To know what is really going on you have to go through the assembly yourself. That also implies knowing the (E)ABI, so you can track function/method parameters, local variables and things like that.

Debugging through the code step by step is usually another option, but Teensy is not made with JTAG in mind nor have I seen anything gdb related. If it is just about algorithms, you might try compiling the whole thing for something QEmu supports for Cortex M4 based SoCs/CPUs, where you have more options in the debugging department, as there are various GUIs supporting gdb, which in some cases includes assembly level view, too.
 
I'm hoping that I may be able to view the type of disassembly where it breaks down the higher level instruction lines and shows how they were implemented. Is that the -S configuration you're referring to.

You would need to look at the GCC compiler sources for that. Unless you are looking for something simple, it can take months (or more) before people can become used to the GCC internals to be able to trace a particular optimization. Generally, you probably want to mostly treat the compiler as a black box, and concentrate more on the audio library (unless you want to learn the compiler internals in general).
 
Great, thanks for the help all. It sounds like objdump will probably do what I have in mind.
This project isn't intended as an in-depth analysis of GCC or optimisation; rather, as part of an overall Audio processing project, I hope to demonstrate how performance on different processor architectures varies (naturally), and having access to show how the assembly code at the end of it will of course vary between these architectures, helps in demonstrating my point.

Thanks all.
 
Note, particularly for audio, you may need to look at extended builtin functions on a per architecture basis. For example, the ARM Cortex M4F has a bunch of SIMD (single instruction, multiple data) instructions that can be used in audio processing. Sometimes the compiler can generate these automatically if you use '-O3' optimization level and use just the right types, but more generally, you have to write explicit calls in your code:
 
Hi Michael; Yes it's these types of instructions that I intended to talk about; I did assume that it would generally pick those up automatically so thanks for that. I'll give it a shot and see what comes of it.
 
Yes it's these types of instructions that I intended to talk about; I did assume that it would generally pick those up automatically ...

No, the compiler pretty much never uses the special DSP extension instructions.

If you look at the audio library code, they're defined in utility/dspinst.h as inline functions with inline assembly. If you look around the rest of the source, you'll find those function names used in many locations.

You'll see names like "16b" and "16t", meaning the bottom or top 16 bits of a 32 bit variable are used as a signed integer. Despite all the talk of SIMD, the most common speedup these instructions offer isn't the actual math, it's allowing 16 bit signed integers to be packed into 32 bit data. This doubles the bandwidth of reading and writing samples from RAM, and it effectively doubles the amount of data you can keep in the CPU's registers. Well, only if all the stuff you want to do with the 16 bit numbers can be done with these instructions. As soon as you need to mask off 16 bits or shuffle samples around to get them into their own registers, you've lost the speedup.

There's also a really useful 16x32 multiply which produces a 48 bit result, and then discards the low 16 bits (with or without roundoff). That turns out to be really handy, as you can see from its usage in many places.

Much of the optimization effort revolves around planning how much fits into the CPU's registers. Cortex-M4 also has a memory optimization, where the first read/write takes 2 cycles, but subsequent ones happen with only 1 cycle. So if you're able to dedicate 4 of the 13 available ARM register to audio samples, you can bring in 8 samples with only 5 clock cycles. It all becomes a game of doing the most you can with the registers, without forcing the compiler to spill anything onto the stack. That 16x32 multiply is particularly handy, since its result takes only a single output register and doesn't temporarily waste another register just to handle the intermediate 48 bit result and discard the low 16 bits.

This sort of strategic planning of the algorithm and data processing strategy around the register set and special instructions that don't conform to C language semantics of 8, 16, 32, 64 bit variables is far beyond how the compiler can optimize ordinary code.
 
Status
Not open for further replies.
Back
Top