fix atomic add for bf16/fp16 (#104592)
Enable atomic_add for fp16 and fix atomic_add issue for bf16/fp16.
Previously the constructor `bfloat16(addr->x);` will invoke
https://github.com/pytorch/pytorch/blob/main/c10/util/BFloat16.h#L99
(construct a `bfloat16` from `float`).
Instead, we actually wish to invoke
https://github.com/pytorch/pytorch/blob/main/c10/util/BFloat16.h#L97
(construct a `bfloat16/float16` from `bits`.
Test Plan:
Remove expected failure for `float16` in `test_torchinductor_opinfo` with op `scatter_reduce, sum`, `scatter_add`, `index_add`, `amax/amin`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104592
Approved by: https://github.com/jgong5, https://github.com/jansel