fixed launch bounds for gamma_cuda_kernel (#60393)
Summary:
Changed launch bounds for gamma_cuda_kernel from 512 to 256.
Timing data (using Nvidia Titan-V):
![GammaTimingData](https://user-images.githubusercontent.com/22803332/122821464-bc873300-d291-11eb-9be6-2fb690f0d5c7.PNG)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60393
Reviewed By: jbschlosser
Differential Revision: D29447926
Pulled By: ngimel
fbshipit-source-id: c2112f9be8ede3bb07cb72f301393f24d17e0c01