For int64_t atomicAdd, use the available compiler builtin on ROCm. (#24854)
Summary:
Do not use the explicit CAS loop. This will perform better if there is
any contention. Since this feature is ROCm-only, the HIP layer provides no
helper function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24854
Differential Revision: D16902292
Pulled By: ezyang
fbshipit-source-id: df192063c749f2b39f8fc304888fb0ae1070f20e