Improve complex lerp performance (#84844)
The complex lerp kernel uses `std::abs(z) < 0.5` which involves
computing a sqrt. Instead compare the square against 0.25 has much
lower latency and so performs much better overall.
In a simple timeit benchmark I see more than 10x speedup on CPU for a 4096
element complex lerp, from 84 us to 6.7 us.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84844
Approved by: https://github.com/ngimel