Introducing (Const)StridedRandomAccessor + CompositeRandomAccessor + migrate `sort` to ATen (CPU) (#39744)
Summary:
This PR introduces a (Const)StridedRandomAccessor, a [random access iterator](https://en.cppreference.com/w/cpp/named_req/RandomAccessIterator) over a strided array, and a CompositeRandomAccessor, a random access iterator over two random access iterators.
The main motivation is to be able to use a handful of operations from STL and thrust in numerous dim-apply types of algorithms and eliminate unnecessary buffer allocations. Plus more advanced algorithms are going to be available with C++17.
Porting `sort` provides a hands-on example of how these iterators could be used.
Fixes [https://github.com/pytorch/pytorch/issues/24770](https://github.com/pytorch/pytorch/issues/24770).
Some benchmarks:
```python
from IPython import get_ipython
torch.manual_seed(13)
ipython = get_ipython()
sizes = [
[10000, 10000],
[1000, 1000, 100]
]
for size in sizes:
t = torch.randn(*size)
dims = len(size)
print(f"Tensor of size {size}")
for dim in range(dims):
print(f"sort for dim={dim}")
print("float:")
ipython.magic("timeit t.sort(dim)")
print()
```
#### Master
```
Tensor of size [10000, 10000]
sort for dim=0
float:
10.7 s ± 201 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
sort for dim=1
float:
6.27 s ± 50.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Tensor of size [1000, 1000, 100]
sort for dim=0
float:
7.21 s ± 23.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
sort for dim=1
float:
6.1 s ± 21.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
sort for dim=2
float:
3.58 s ± 27 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```
#### This PR
```
Tensor of size [10000, 10000]
sort for dim=0
float:
10.5 s ± 209 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
sort for dim=1
float:
6.16 s ± 28.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Tensor of size [1000, 1000, 100]
sort for dim=0
float:
5.94 s ± 60.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
sort for dim=1
float:
5.1 s ± 11.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
sort for dim=2
float:
3.43 s ± 8.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```
As you can see, the legacy sorting routine is actually quite efficient. The performance gain is likely due to the improved reduction with TensorIterator.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39744
Reviewed By: malfet
Differential Revision: D23796486
Pulled By: glaringlee
fbshipit-source-id: 7bddad10dfbc0a0e5cad7ced155d6c7964e8702c