Use cub::BlockRadixSort to improve medium length sort performance
In my testing, replacing the custom bitonic sort with cub's block
level radix sort primitives improves overall sort performance by up to
3x, depending on input length. This also benefits from being a stable
sort, and so we get up to 25x speedup for small stable sorts and
around 2x speedup on the largest supported size.
In testing, the radix sort benefits a lot from having more items per
thread and so it does break down at very small sizes. So, for the
32-item sort I've left the bitonic sorting algorithm in place.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79628
Approved by: https://github.com/ngimel