Rework PowKernel.cu (#62260)
Summary:
PowKernel.cu is the single slowest file to compile in all of pytorch, taking
7 m 34 s on my machine. After investigating, I discovered that the case with
complex inputs and a cpu scalar for the first argument takes more than half that
time just on its own.
Noting that [`thrust::pow`] for complex is just `exp(log(base) * exponent)`,
we can improve this kernel by precomputing `log(base)` on cpu and computing
only the `exp` on CUDA. This is faster in both runtime and compile time.
For 1 million elements, master takes 61.6 us vs 56.9 us with this PR.
I also noticed that the constant exponent case is implemented twice, once in
`gpu_kernel_with_scalars` and again in `pow_tensor_scalar_kernel`. Further, the
`Pow.cpp` code detects cpu-scalar exponents and redispatches to the `tensor_scalar`
overload, making the `gpu_kernel_with_scalars` version dead code. Now instead,
we unconditionally run `tensor_tensor` and it will call into `tensor_scalar` if appropriate.
With these changes, PowKernel.cu takes just 2 m 30 s to compile.
[`thrust::pow`]: https://github.com/NVIDIA/thrust/blob/368266e80e69d86d4b53f50cd02afb56a619eee2/thrust/detail/complex/cpow.h#L33
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62260
Reviewed By: ejguan
Differential Revision: D29938789
Pulled By: ngimel
fbshipit-source-id: 7ab7d81ececc92a9e6e62e60b0a4f2e6e3146df8