Make discontiguous tensors also benefit from unrolling (#34708)
Summary:
This is based on https://github.com/pytorch/pytorch/pull/33720, I didn't use stacked diff because is not very convenient for cherry-picking. Please review after https://github.com/pytorch/pytorch/issues/33720 merged.
Benchmark shows an up to ~10% improvement on half on RTX 2080Ti:
https://github.com/zasdfgbnm/things/blob/master/2020Q1/benchmark-unroll-with-discontig-input.ipynb
We now have a `TrivialOffsetCalculator`, and the unroll strategy takes input offset calculator and output offset calculator as arguments of its constructor. In case of when we know that it is contiguous (for example when the unroll strategy is used inside vectorized kernel), the trivial offset calculator will be used.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34708
Differential Revision: D20601566
Pulled By: ngimel
fbshipit-source-id: e20e38517efb31c8af5fc377538992a980ff4130