Ok, there's a bug in the code, where you should assign std::max() result to max.
Depending on the architecture the loop unrolling may actually hurt performance in terms of instruction cache misses and because some architectures got branch prediction to make loop branching practically free. Not sure in your case with your target MCU, just speaking from CPU developer background
If you have compile-time constant loop count (as you do since you unroll it manually by simply copying the code and not use like
Duff's device), it can be beneficial to leave the loop unrolling to the compiler, which can do it to "optimal degree" for given target. Anyway, probably it's best to check the generated assembly code to see if there are some further optimizations, but looks good to me