pytorch
9dc2bcdc - Introducing (Const)StridedRandomAccessor + CompositeRandomAccessor + migrate `sort` to ATen (CPU) (#39744)

Commit
4 years ago
Introducing (Const)StridedRandomAccessor + CompositeRandomAccessor + migrate `sort` to ATen (CPU) (#39744) Summary: This PR introduces a (Const)StridedRandomAccessor, a [random access iterator](https://en.cppreference.com/w/cpp/named_req/RandomAccessIterator) over a strided array, and a CompositeRandomAccessor, a random access iterator over two random access iterators. The main motivation is to be able to use a handful of operations from STL and thrust in numerous dim-apply types of algorithms and eliminate unnecessary buffer allocations. Plus more advanced algorithms are going to be available with C++17. Porting `sort` provides a hands-on example of how these iterators could be used. Fixes [https://github.com/pytorch/pytorch/issues/24770](https://github.com/pytorch/pytorch/issues/24770). Some benchmarks: ```python from IPython import get_ipython torch.manual_seed(13) ipython = get_ipython() sizes = [ [10000, 10000], [1000, 1000, 100] ] for size in sizes: t = torch.randn(*size) dims = len(size) print(f"Tensor of size {size}") for dim in range(dims): print(f"sort for dim={dim}") print("float:") ipython.magic("timeit t.sort(dim)") print() ``` #### Master ``` Tensor of size [10000, 10000] sort for dim=0 float: 10.7 s ± 201 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) sort for dim=1 float: 6.27 s ± 50.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Tensor of size [1000, 1000, 100] sort for dim=0 float: 7.21 s ± 23.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) sort for dim=1 float: 6.1 s ± 21.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) sort for dim=2 float: 3.58 s ± 27 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` #### This PR ``` Tensor of size [10000, 10000] sort for dim=0 float: 10.5 s ± 209 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) sort for dim=1 float: 6.16 s ± 28.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Tensor of size [1000, 1000, 100] sort for dim=0 float: 5.94 s ± 60.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) sort for dim=1 float: 5.1 s ± 11.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) sort for dim=2 float: 3.43 s ± 8.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` As you can see, the legacy sorting routine is actually quite efficient. The performance gain is likely due to the improved reduction with TensorIterator. Pull Request resolved: https://github.com/pytorch/pytorch/pull/39744 Reviewed By: malfet Differential Revision: D23796486 Pulled By: glaringlee fbshipit-source-id: 7bddad10dfbc0a0e5cad7ced155d6c7964e8702c
Author
Parents
Loading