Use fastAtomicAdd in GPU upsampling trilinear (#48675)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44206
This PR basically follows the diff in https://github.com/pytorch/pytorch/pull/21879 for upsampling bilinear.
For the script provided in https://github.com/pytorch/pytorch/issues/44206 , on my 2070 super GPU, the total timing I got (time in second)
| | non-amp | amp |
|---|---|---|
| before PR | 2.88 | 9.6 |
| after PR | 1.5 | 1.6 |
kernel time after PR
| | time | kernel |
| --- | --- | --- |
| non-amp | 0.37 ms | `void at::native::(anonymous namespace)::upsample_trilinear3d_backward_out_frame<float, float>(unsigned long, int, int, int, int, int, int, float, float, float, bool, float*, float const*) ` |
| amp | 0.61 ms | `void at::native::(anonymous namespace)::upsample_trilinear3d_backward_out_frame<c10::Half, float>(unsigned long, int, int, int, int, int, int, float, float, float, bool, c10::Half*, c10::Half const*)` |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48675
Reviewed By: bdhirsh
Differential Revision: D25284853
Pulled By: ngimel
fbshipit-source-id: 30f0d92e73050edd36013ce528d2e131effa3542