Small fixes to improve TensorIterator overhead for the common case of inputs and outputs of the same type (#27457)
Summary:
1) Short-circuits computing common type and type promotion logic for the common case of operands and result of the same type
2) Improves performance of checking memory overlap by returning MemoryOverlap::FULL if tensors are the same, skips the call
from TensorIterator when tensors are the same
3) Changes the default size of DimVector from 5 to 6, thus allowing it not to be resized for a common case of binary operation. `strides`
DimVector is forced to have at least 2*num_tensors elements, which for an operation with 2 inputs and one output is 6
4) If `offset` is 0 (common non-broadcasting case), don't fill `strides` vector with 0-s, because all the values will be subsequently written to.
These changes combined improve the overhead from 1.02 us to .74 us for a simple in-place operation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27457
Test Plan: should be covered by existing tests
Differential Revision: D17784532
Pulled By: ngimel
fbshipit-source-id: e6a8ee58be5de14461bdbc2e2b0b6d16a96c309f
Author
Natalia Gimelshein