Fix PyTorch separate compilation (#34863)
Summary:
Looks like there is a bug in CUDA device linker, but kernels that uses `thust::sort_by_key` can not be linked with other kernels
Solve the problem by splitting 5 thrust-heavy .cu files into `__torch_cuda_sp` library which is statically linked into `torch_cuda`
For default compilation workflow it should not make any difference.
Test Plan: Compile with `-DCUDA_SEPARABLE_COMPILATION=YES` and observe library size difference: 310Mb before, 173Mb after if compiled for sm_75
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34863
Differential Revision: D20683972
Pulled By: malfet
fbshipit-source-id: bc1492aa9d1d2d21c48e8764a8a7b403feaec5da