Fix PyTorch separate compilation (Reland) (#35581)
Summary:
Looks like there is a bug in CUDA device linker, but kernels that uses `thust::sort_by_key` can not be linked with other kernels
Solve the problem by splitting 5 thrust-heavy .cu files into `__torch_cuda_sp` library which is statically linked into `torch_cuda`
For default compilation workflow it should not make any difference.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35581
Test Plan: Compile with `-DCUDA_SEPARABLE_COMPILATION=YES` and observe library size difference: 310Mb before, 173Mb after if compiled for sm_75
Differential Revision: D20741379
Pulled By: malfet
fbshipit-source-id: e9083968324c113e44a39df0de356d79af8e7057