[CUDA] Add fastAtomicAdd to scatter_add [v2]
Reland of https://github.com/pytorch/pytorch/pull/75140
Close https://github.com/pytorch/pytorch/issues/74487
Close https://github.com/pytorch/pytorch/issues/75434
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75545
Approved by: https://github.com/ngimel