Fix performance of CUDA trilinear interpolate backward (#52351)
Summary:
Close https://github.com/pytorch/pytorch/issues/51206
This PR basically reverts the CUDA launch configuration changes made in https://github.com/pytorch/pytorch/issues/48675, then only apply a `gpuAtomicAdd` -> `fastAtomicAdd` replacement in the CUDA kernel.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52351
Reviewed By: seemethere
Differential Revision: D26597006
Pulled By: ngimel
fbshipit-source-id: 4a34a351a75c80f714e50cf6dae2c31ddb901ffe