CUDA: Replace init_offsets kernel with iterators in cub-based argsort (#18930)
* CUDA: Replace `init_offsets` with iterators in argsort
This is a QOL improvement, saving us the cost of materializing the
iterator
* Remove unnecessary include from top-k.cu