updated launch bounds for trilinear 3d (#59999)
Summary:
Updates launch bounds for upsample_trilinear_3d forward and backward kernel to remove register spilling into local memory. Improves runtime for forward pass by 3-4x factor, backward pass has same runtime (probably different bottleneck).
Timing data: (Using Nvidia Titan-V GPU)
![TrilinearTimingData](https://user-images.githubusercontent.com/22803332/121979658-72f19200-cd3f-11eb-9363-c00e2c4eea6d.PNG)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59999
Reviewed By: zou3519
Differential Revision: D29185976
Pulled By: ngimel
fbshipit-source-id: 0b2313e70e45c53938cd7262464d3aa4fab8da4a