pytorch
da87d648 - `F.avg_pool3` CUDA backward: gpuAtomicAddNoReturn -> fastAtomicAdd (#63387)

Commit

3 years ago

`F.avg_pool3` CUDA backward: gpuAtomicAddNoReturn -> fastAtomicAdd (#63387) Summary: Rel: https://github.com/pytorch/pytorch/issues/62695 In the following two tables, I set `kernel_size` to 3 and `stride` to 2. In benchmark, input tensors have the shape of (N, C, n_features, n_features, n_features). Tested on RTX3080 w/ CUDA11.4 Update 1. ## This PR | N | C | n_features | dtype | time | |----:|----:|-------------:|:--------------|------------:| | 32 | 3 | 8 | torch.float16 | 7.46846e-05 | | 32 | 3 | 8 | torch.float32 | 8.18968e-05 | | 32 | 3 | 32 | torch.float16 | 0.000156748 | | 32 | 3 | 32 | torch.float32 | 0.000165236 | | 32 | 3 | 128 | torch.float16 | 0.00549854 | | 32 | 3 | 128 | torch.float32 | 0.008926 | ## master (6acd87f) | N | C | n_features | dtype | time | |----:|----:|-------------:|:--------------|------------:| | 32 | 3 | 8 | torch.float16 | 7.60436e-05 | | 32 | 3 | 8 | torch.float32 | 7.55072e-05 | | 32 | 3 | 32 | torch.float16 | 0.000189292 | | 32 | 3 | 32 | torch.float32 | 0.000168645 | | 32 | 3 | 128 | torch.float16 | 0.00699538 | | 32 | 3 | 128 | torch.float32 | 0.00890226 | master's time divided by PR's time is as follows: | N | C | n_features | master / PR | |---:|---:|---------------:|----------------:| | 32 | 3 | 8 | 1.018 | | 32 | 3 | 32 | 1.208 | | 32 | 3 | 128 | 1.272| cc: xwang233 ptrblck ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/63387 Reviewed By: mruberry Differential Revision: D30381434 Pulled By: ngimel fbshipit-source-id: 3b97aee4b0d457a0277a0d31ac56d4151134c099

References

#65112 - [LTC] Merge master

Author

crcrpar

Committer

facebook-github-bot

Parents

6e5d065b

pytorch da87d648 - `F.avg_pool3` CUDA backward: gpuAtomicAddNoReturn -> fastAtomicAdd (#63387)

pytorch
da87d648 - `F.avg_pool3` CUDA backward: gpuAtomicAddNoReturn -> fastAtomicAdd (#63387)