Sometimes I wonder if that compiler has specific optimizations crafted for the coremark code.
I don't know about the ARM compiler, but it is certainly part of the compiler game. And it has been alleged that various compilers work better for building benchmarks than real code.
Generally, I do try to put in general optimizations, but sometimes, you do put something in that particularly helps a particular benchmark.
Of course with benchmarks, it helps if the bencmark actualy is close to what is being run by the users. Of the 22 benchmarks in Spec 2017 speed (and 30 in Spec 2006 CPU) I tend to think that two benchmarks match typical user code. Perlbench tends to match interpreter work loads, where it might be dominated by a switch statement, but you can get caught by a function that needs to save a lot of registers in the prologue and restore them in the epilogue. Gcc also tends to match normal code, because it has one of the flattest profiles around (i.e. there isn't a single function that the benchmark spends 50% of its time in which makes it easier to optimize for, but the highest function is something like 1% -- it is just a lot of data and branches).
In the distant past I was working at a now-dead minicomputer company Data General. This was about the time when I was moving over from the AOS/VS C compiler I had written the front end for to the 88000 GCC compiler for the AViiON systems. Now, DG's penchant was to name their MV-Eclipse computers with a number that indicated the relative performance. The first machine was the MV/8000 (this was the machine in Tracy Kidder's Soul of a New Machine). The next generation had a MV/4000 and a MV/10000, with the MV/4000 generally being 1/2 the speed and 1/2 the cost of the MV/8000. The third generation had a mid-life kicker for the MV/8000 called the MV/7800. It was supposed to be similar in performance to the MV/8000 at a lower cost. That was the theory.
Now inside, the machine had a really good FP unit that was roughly equal to the MV/8000, but the integer unit was closer to the MV/4000. Unfortunately the customers who bought the MV/7800 mostly used a weird Cobol variant (icobol if memory serves). This language only used integer instructions. So needless to say users were rather angry that they bought this MV/7800 and it didn't run any faster than the less expensive MV/4000.
In the post mortem after the machine was delivered, the perf. group looked at the benchmarks to see what went wrong. Now the benchmark that they used at the time was whetstone, which is primarily a floating point benchmark that was originally written in Algol 60, and then converted to Fortran. So using Whetstone pretty much showed the speed of the FPU unit. The performance group looked around at the other benchmarks at the time, and settled on using Dhrystone, which favored integer stuff, which more closely matched what their customers were using. Dhrystone BTW was originally written in Ada and then converted to C, and yes, the name was chosen as a pun against Whetstone.
So, I'm cranking away on making the GCC 88000 compiler work, and we get this call from on high to measure the performance of GCC. Now you have to understand, GCC wasn't the original compiler DG used for the 88000, but Green Hills was. However, Green Hills had some wacky license restrictions that upper management felt they could not sign. So they went in search of other compilers to use. I was off in my little cube. I had a Sun workstation, and I wanted to make Emacs run faster. So I found the initial GCC, and it indeed did make Emacs run faster (mostly because Emacs and GCC were written by the same person, and he put in special optimizations for Emacs into the compiler). But as I was looking at GCC, I saw a 1/2 done port of GCC for the 88000.
I brought this to my manager's attention, and he ran it up the flag pole, eventually getting legal to sign off on using GCC (due to the GPL). The theory was GCC was going to be the free compiler, but if you wanted a better compiler, you could pay for Green Hills.
But people did want to validate using GCC even if GH was supposed to be a better compiler. As I said, Dhrystone was the measure of performance in DG those days. And true enough, GH was faster than GCC. I dug into the benchmark, and discovered the core of Dhrystone is a copy of a 31 byte character literal with
strcpy to an array. Now in the original Ada, this copy was pretty fast because Ada has real strings, and the compiler could know the length. The naive C library just does load byte, is it 0, stop, store byte, rinse, lather, repeat. The GH compiler saw that the
strcpy was of a constant string literal, and changed this into a
memcpy, which they could then inline. Initially GCC was just calling the
strcpy function.
I figured two can play that game, and added similar support to GCC. Now in my code, I carefully made sure the cutoff before calling the function was 32 bytes, so that it would general inline code. When I coded this up, and ran it, it was much faster than GH. Since then I never really trust benchmarks, though I seem to spend a lot of time actually looking at benchmark results (spec 2017 being the current target).