use explicitly non-returning GPU atomics (#60607)
Summary:
Enables an important performance optimization for ROCm, in light of the discussion in https://github.com/pytorch/pytorch/issues/41028.
CC jithunnair-amd sunway513
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60607
Reviewed By: jbschlosser
Differential Revision: D29409894
Pulled By: ngimel
fbshipit-source-id: effca258a0f37eaefa35674a7fd19459ca7dc95b