Fix FP16 fastAtomicAdd for one case where tensor start address is not 32 bit aligned (#44642)
Summary:
For https://github.com/pytorch/pytorch/issues/44206 and https://github.com/pytorch/pytorch/issues/42218, I'd like to update trilinear interpolate backward and grid_sample backward to use `fastAtomicAdd`.
As a prelude, I spotted a UB risk in `fastAtomicAdd`. I think existing code incurs a misaligned `__half2` atomicAdd when `index` is odd and `tensor` is not 32-bit aligned (`index % 2 == 1` and `(reinterpret_cast<std::uintptr_t>(tensor) % sizeof(__half2) == 1`). In this case we think we're `!low_bit` and go down the `!low_bit` code path, but in fact we are `low_bit`. It appears the original [fastAtomicAdd PR](https://github.com/pytorch/pytorch/pull/21879#discussion_r295040377)'s discussion did not consider that case explicitly.
I wanted to push my tentative fix for discussion ASAP. jjsjann123 and mkolod as original authors of `fastAtomicAdd`. (I'm also curious why we need to `reinterpret_cast<std::uintptr_t>(tensor...` for the address modding, but that's minor.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44642
Reviewed By: mruberry
Differential Revision: D23699820
Pulled By: ngimel
fbshipit-source-id: 0db57150715ebb45e6a1fb36897e46f00d61defd