SemanticDiff

pytorch
8c0796e5 - Use cub::BlockRadixSort to improve medium length sort performance (#79628)

Commit View On GitHub

Login via GitHub
Home
Pricing
FAQ
Install

Login via GitHub

Commit

2 years ago

Use cub::BlockRadixSort to improve medium length sort performance (#79628) In my testing, replacing the custom bitonic sort with cub's block level radix sort primitives improves overall sort performance by up to 3x, depending on input length. This also benefits from being a stable sort, and so we get up to 25x speedup for small stable sorts and around 2x speedup on the largest supported size. In testing, the radix sort benefits a lot from having more items per thread meaning it breaks down a bit at very small sizes. So, for the 32-item sort I've left the bitonic sorting algorithm in place. Binary size is also reduced in this change, because I have moved the `descending` branch into the kernel itself which I found not to effect performance. The result is a 1.9 MB decrease in `torch_cuda.so` on my build for one cuda architecture. Pull Request resolved: https://github.com/pytorch/pytorch/pull/79628 Approved by: https://github.com/ngimel

Author

peterbell10

peterbell10

Committer

pytorchmergebot

pytorchmergebot

Parents

FAQ Terms Privacy Refunds Impressum

Loading