pytorch
61305cd6 - Improve small sort performance on CUDA

Commit
2 years ago
Improve small sort performance on CUDA Currently, `bitonicSortKVInPlace` is written to sort one array per block of threads. If that dimension happens to be very small (<128 elements), this results in low thread occupancy. Instead, this changes `bitonicSortKVInPlace` to operate with a 2d block. Sorting happens along the x dimension, and the y dimension is a fixed size batch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/79627 Approved by: https://github.com/ngimel
Committer
Parents
Loading