speed up 1d sort (#77100)
This speeds up sort for 1d case (by approx 2x for large sizes) where segment sorting is not required and slightly reduces memory usage. I'm not sure if memory usage can be meaningfully improved.
Slightly helps #77049
I'll update PR with doing the same for multiple segments, provided that each segment size is large.
cc @peterbell10, I had to comment out TORCH_ASSERT_NO_OPERATORS because otherwise I was getting compilation errors, do you know what's up?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77100
Approved by: https://github.com/zasdfgbnm, https://github.com/mruberry