Moves grid_sampler to autocast promote list (#58618)
Summary:
Should close https://github.com/pytorch/pytorch/issues/42218
Numerically, `grid_sampler` is fine in fp16 or fp32, but takes several inputs and expects their dtypes to match, so it belongs on the autocast promote list.
`grid_sampler` currently uses `gpuAtomicAdd`, notoriously slow in fp16 because it calls cuda's atomicAdd __half overload which uses a software compare-and-swap loop internally. To allow good performance if both inputs happen to be FP16, the PR also modifies `grid_sampler_[2,3]d_backward_kernel`s to use `fastAtomicAdd` instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58618
Reviewed By: mruberry
Differential Revision: D29257199
Pulled By: ngimel
fbshipit-source-id: 3cc7505945b480427f2fc1beb36bee80bf3853b3