changed launch bounds, unrolled for loop for grid sampler 2d fwd and bwd (#60405)
Summary:
Changed launch bounds for grid sampler 2d fwd and bwd from 1024 to 256, added loop unrolling to fix register spilling into local memory.
Timing Data: (using Nvidia Titan-V)
Interpolation mode 2, padding 0, align corners False

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60405
Reviewed By: albanD
Differential Revision: D29288075
Pulled By: ngimel
fbshipit-source-id: 5e060f0c2d1cc0a3086718e6be263413dfa29689