I just clocked the inertial frontend and it came in on around 1200 clock cycles for the statistics calculation, 200 clock cycles per IMU for the calibration compensation (static bias, misalignment, scale error compensation), and around 400 clock cycles for the (floating point) on-line bias estimation. The bias estimation is probably dominated by the division operation in the gain calculation. Unfortunately, the microcontroller has no hardware floating point division operation.
The 1200 clock cycles for the statistics calculation can be compared with around 5500 floating point arithmetic operation which would be required to calculate just the longer 256 sample test statistics in the naive way. And this computational cost would scale with the number of samples. I have not clocked the old functions but together with memory accesses and other auxiliary operations, I guess the old (naive) implementation would take at least about 10 times the number of clock cycles. I had previously done some back-of-an-envelope calculations of the computational cost reduction of the recursive integer statistics calculations but this was the first time I have seen it for real.