pytorch
8a28bbee - various TensorIterator speed improvements (#58810)

Commit
3 years ago
various TensorIterator speed improvements (#58810) Summary: 1) remove pushing back to strides vector for 1D tensors, those strides are never used in the loop anyway 2) avoid calling get_data_ptrs unless necessary 3) don't call into assert_no_partial_overlap if tensorImpls are the same (assert_no_partial_overlap has this comparison too, but after a couple of nested function calls) 4) is_non_overlapping_and_dense instead of is_contiguous in memory overlap (which, for some reason, is faster than is_contiguous, though I hoped after is_contiguous is non-virtualized, it should be the same). Altogether, brings instruction count down from ~110K to 102735 for the following binary inplace benchmark: ``` In [2]: timer = Timer("m1.add_(b);", setup="at::Tensor m1=torch::empty({1}); at::Tensor b = torch::empty({1});", language="c++", timer=timeit.default_timer) ...: stats=timer.collect_callgrind(number=30, repeats=3) ...: print(stats[1].as_standardized().stats(inclusive=False)) ``` similar improvements for unary inplace. Upd: returned stride packing for now, counts is now 104295, so packing is worth ~ 52 instructions, we should think about how to remove it safely. Pull Request resolved: https://github.com/pytorch/pytorch/pull/58810 Reviewed By: bhosmer Differential Revision: D28664514 Pulled By: ngimel fbshipit-source-id: 2e03cf90b37a411d9994a7607402645f1d8f3c93
Author
Natalia Gimelshein
Parents
Loading