Use `atomicAdd` for `bfloat16` in Ampere and above (#84981)
WIP to fix extremely slow `scatter_add` issue vs. fp16. The current changes seem to improve performance, but it still appears to lag behind the fp16 equivalent.
CC @ngimel @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84981
Approved by: https://github.com/ngimel