Use cub::BlockRadixSort to improve medium length sort performance (#79628)
In my testing, replacing the custom bitonic sort with cub's block
level radix sort primitives improves overall sort performance by up to
3x, depending on input length. This also benefits from being a stable
sort, and so we get up to 25x speedup for small stable sorts and
around 2x speedup on the largest supported size.
In testing, the radix sort benefits a lot from having more items per
thread meaning it breaks down a bit at very small sizes. So, for the
32-item sort I've left the bitonic sorting algorithm in place.
Binary size is also reduced in this change, because I have moved the
`descending` branch into the kernel itself which I found not to effect
performance. The result is a 1.9 MB decrease in `torch_cuda.so` on
my build for one cuda architecture.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79628
Approved by: https://github.com/ngimel