Implement `gpu_kernel_multiple_outputs` (#37969)
Summary:
This PR introduces a variant of `gpu_kernel` for functions that return multiple values with `thrust::tuple`.
With this I simplified `prelu_cuda_backward_share_weights_kernel`.
### Why using `thrust::tuple`?
Because `std::tuple` does not support `operator=` on device code which makes the implementation complicated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37969
Reviewed By: paulshaoyuqiao
Differential Revision: D22868670
Pulled By: ngimel
fbshipit-source-id: eda0a29ac0347ad544b24bf60e3d809a7db1a929