pytorch
1e2b2ee5 - sort_out_cuda: Use custom kernels to fill index tensors (#66668)

Commit
3 years ago
sort_out_cuda: Use custom kernels to fill index tensors (#66668) Summary: These stable sorts currently use a combination of `at::arange`, view ops and `tensor.copy_` to fill in the initial values for the indices before calling into `CUB` to do the actual sort. This is somewhat inefficient because it requires 2 to 4 kernel launches, and the copies all use strided kernels instead of the more efficient contiguous kernels. Instead, a fairly straight-forward custom kernel is more efficient in terms of both CUDA and CPU runtime. In a simple benchmark I profiled `a.sort(stable=True, dim=1)` for different shapes and single out the kernel invocations for intitializing the index tensors (i.e. the non-`cub` kernels). Note that when the batch dim is `<128` we call `segmented_sort_pairs_by_full_sort` instead of `segmented_sort_pairs`: | shape | Master (us) | This PR (us) | |--------------|:-----------:|:------------:| | (100, 1000) | 5.000 | 2.300 | | (1000, 100) | 2.070 | 1.090 | | (100, 10000) | 87.34 | 26.47 | | (1000, 1000) | 28.63 | 20.27 | Of course for sufficiently large inputs, the overall runtime is dominated by the actual sort. But I have another motive of wanting to remove operator the calls from the middle of this kernel launch code. This change makes it easier to split the kernel code that needs to be compiled with `nvcc` into it's own file that doesn't include `Tensor.h`, similar to what I'm doing in https://github.com/pytorch/pytorch/issues/66620. Pull Request resolved: https://github.com/pytorch/pytorch/pull/66668 Reviewed By: H-Huang Differential Revision: D31693722 Pulled By: ngimel fbshipit-source-id: 5765926e4dbbc7a20d2940c098ed093b3de2204e
Author
Parents
Loading