changed launch bounds, unrolled for loop for grid sampler 2d fwd and bwd (#60405)
Summary:
Changed launch bounds for grid sampler 2d fwd and bwd from 1024 to 256, added loop unrolling to fix register spilling into local memory.
Timing Data: (using Nvidia Titan-V)
Interpolation mode 2, padding 0, align corners False
![GridSampler2dTimingData](https://user-images.githubusercontent.com/22803332/122830305-01fd2d80-d29d-11eb-9cd3-7da533a03f33.PNG)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60405
Reviewed By: albanD
Differential Revision: D29288075
Pulled By: ngimel
fbshipit-source-id: 5e060f0c2d1cc0a3086718e6be263413dfa29689