`F.avg_pool3` CUDA backward: gpuAtomicAddNoReturn -> fastAtomicAdd (#63387)
Summary:
Rel: https://github.com/pytorch/pytorch/issues/62695
In the following two tables, I set `kernel_size` to 3 and `stride` to 2.
In benchmark, input tensors have the shape of (N, C, n_features, n_features, n_features).
Tested on RTX3080 w/ CUDA11.4 Update 1.
## This PR
| N | C | n_features | dtype | time |
|----:|----:|-------------:|:--------------|------------:|
| 32 | 3 | 8 | torch.float16 | 7.46846e-05 |
| 32 | 3 | 8 | torch.float32 | 8.18968e-05 |
| 32 | 3 | 32 | torch.float16 | 0.000156748 |
| 32 | 3 | 32 | torch.float32 | 0.000165236 |
| 32 | 3 | 128 | torch.float16 | 0.00549854 |
| 32 | 3 | 128 | torch.float32 | 0.008926 |
## master (6acd87f)
| N | C | n_features | dtype | time |
|----:|----:|-------------:|:--------------|------------:|
| 32 | 3 | 8 | torch.float16 | 7.60436e-05 |
| 32 | 3 | 8 | torch.float32 | 7.55072e-05 |
| 32 | 3 | 32 | torch.float16 | 0.000189292 |
| 32 | 3 | 32 | torch.float32 | 0.000168645 |
| 32 | 3 | 128 | torch.float16 | 0.00699538 |
| 32 | 3 | 128 | torch.float32 | 0.00890226 |
master's time divided by PR's time is as follows:
| N | C | n_features | master / PR |
|---:|---:|---------------:|----------------:|
| 32 | 3 | 8 | 1.018 |
| 32 | 3 | 32 | 1.208 |
| 32 | 3 | 128 | 1.272|
cc: xwang233 ptrblck ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63387
Reviewed By: mruberry
Differential Revision: D30381434
Pulled By: ngimel
fbshipit-source-id: 3b97aee4b0d457a0277a0d31ac56d4151134c099