* On this, for those interested why Ascon is slower when papers say it should be faster:

This issue lies in the state and one specific operation. As both are `XOR` / `ROTATE` based one would expect code efficiency to be the same, however ChaCha uses an internal state based on 32bit words while Ascon uses an internal state based on 64bit words. Why does this matter? Instruction support it Cortex-M. The `ROTATE` operation is 1 instructions on 32bit words but 5 instructions for 64bit words.

And that's basically it, Cortex-M MCUs don't have a rotate instruction for double words.
It was an interesting dig!

Edit: Sorry wrote the results for Cortex-M0...