Improve small sort performance on CUDA
Currently, `bitonicSortKVInPlace` is written to sort one array per
block of threads. If that dimension happens to be very small
(<128 elements), this results in low thread occupancy.
Instead, this changes `bitonicSortKVInPlace` to operate with a 2d
block. Sorting happens along the x dimension, and the y dimension
is a fixed size batch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79627
Approved by: https://github.com/ngimel