Replace thrust with cub in randperm (#53841)
Summary:
Benchmark of
```python
%timeit torch.randperm(100000, device='cuda'); torch.cuda.synchronize()
```
thrust:
```
5.76 ms ± 42.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```
cub:
```
3.02 ms ± 32.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```
sync in thrust sort is removed
Warning:
Thrust supports 64bit indexing, but cub doesn't, so this is a functional regression. However, `torch.randperm(2**31, device='cuda')` fails with OOM on 40GB A100, and `torch.randperm(2**32, device='cuda')` fails with OOM on 80GB A100, so I think this functional regression has low impact and is acceptable.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53841
Reviewed By: albanD
Differential Revision: D26993453
Pulled By: ngimel
fbshipit-source-id: 39dd128559d53dbb01cab1585e5462cb5f3cceca