pytorch
67a5d0bf - Use cub::BlockRadixSort to improve medium length sort performance

Commit
2 years ago
Use cub::BlockRadixSort to improve medium length sort performance In my testing, replacing the custom bitonic sort with cub's block level radix sort primitives improves overall sort performance by up to 3x, depending on input length. This also benefits from being a stable sort, and so we get up to 25x speedup for small stable sorts and around 2x speedup on the largest supported size. In testing, the radix sort benefits a lot from having more items per thread and so it does break down at very small sizes. So, for the 32-item sort I've left the bitonic sorting algorithm in place. Pull Request resolved: https://github.com/pytorch/pytorch/pull/79628 Approved by: https://github.com/ngimel
Author
Committer
Parents
Loading