pytorch
61305cd6 - Improve small sort performance on CUDA

Commit

2 years ago

Improve small sort performance on CUDA Currently, `bitonicSortKVInPlace` is written to sort one array per block of threads. If that dimension happens to be very small (<128 elements), this results in low thread occupancy. Instead, this changes `bitonicSortKVInPlace` to operate with a 2d block. Sorting happens along the x dimension, and the y dimension is a fixed size batch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/79627 Approved by: https://github.com/ngimel

Author

pytorchmergebot

Committer

pytorchmergebot

Parents

9244547a

pytorch 61305cd6 - Improve small sort performance on CUDA

pytorch
61305cd6 - Improve small sort performance on CUDA