pytorch
171e13fd - Rework PowKernel.cu (#62260)

Commit
3 years ago
Rework PowKernel.cu (#62260) Summary: PowKernel.cu is the single slowest file to compile in all of pytorch, taking 7 m 34 s on my machine. After investigating, I discovered that the case with complex inputs and a cpu scalar for the first argument takes more than half that time just on its own. Noting that [`thrust::pow`] for complex is just `exp(log(base) * exponent)`, we can improve this kernel by precomputing `log(base)` on cpu and computing only the `exp` on CUDA. This is faster in both runtime and compile time. For 1 million elements, master takes 61.6 us vs 56.9 us with this PR. I also noticed that the constant exponent case is implemented twice, once in `gpu_kernel_with_scalars` and again in `pow_tensor_scalar_kernel`. Further, the `Pow.cpp` code detects cpu-scalar exponents and redispatches to the `tensor_scalar` overload, making the `gpu_kernel_with_scalars` version dead code. Now instead, we unconditionally run `tensor_tensor` and it will call into `tensor_scalar` if appropriate. With these changes, PowKernel.cu takes just 2 m 30 s to compile. [`thrust::pow`]: https://github.com/NVIDIA/thrust/blob/368266e80e69d86d4b53f50cd02afb56a619eee2/thrust/detail/complex/cpow.h#L33 Pull Request resolved: https://github.com/pytorch/pytorch/pull/62260 Reviewed By: ejguan Differential Revision: D29938789 Pulled By: ngimel fbshipit-source-id: 7ab7d81ececc92a9e6e62e60b0a4f2e6e3146df8
Author
Parents
Loading