Reduce overheads on several CPU kernels by avoiding restrides. (#36875)
Summary:
Calling `t.as_strided(..., ...)` must make a new `TensorImpl` to back the new tensor, which takes 300-400 ns. Reduction, scatter/gather, and comparison kernels currently restride inputs and outputs in order to handle `dim` inside the function passed to TensorIterator. Because these Tensors are created solely for consumption by the iterator a full restride and metadata copy is surplus to requirements. Moreover, shapes are already checked by these kernels prior to calling `add_input` and `add_output`, so shape inference and broadcasting are also unnecessary.
This PR adds a `TensorIterator::declare_static_shape(...)` method, which allows certain kernels to use a much more constrained and efficient shape path. This results in a 900-1200 ns speedup for `gather / scatter / scatter_add / cumsum / cumprod` and a 250-500 ns speedup for elementwise `min` and `max`.
Measurements were taken with [this python script](https://gist.github.com/robieta/51ac5db2f9c7e812d5ff264403ce4f92), which is driven by [this bash script](https://gist.github.com/robieta/1420e917cf38885de3093f8c3a7bd437). The general procedure for mitigating environmental skew is to repeatedly switch between an environment which is built with master and one which is built with this branch while running the python script. Within the python measurement script the following was used to reduce variation:
* Set number of threads to 1
* Aggressively and randomly interleave task measurements to limit correlation between tasks and system state based on when they were run or what task preceded the current one.
* Warmup period, dropping the first three passes through all of the tasks.
Two independent end-to-end runs are included since there is some variation even with the above measures. Overall measurement error seems to be about +/- 100 ns.
The benchmark also includes several tasks which are not affected by this PR, both to check for a degradation in TensorIterator performance when static shapes are not set (which did happen for an earlier iteration of this optimization) and to estimate measurement variability and validate that measured improvements are significant.
**First run**:
```
Delta (median) Master (25%, 75%) Branch (25%, 75%)
---------------------------------------------------------------------------------------------------------
gather_1D | 920 | 4,000 (-170, +230) | 3,100 (-110, +140)
gather_dim0 | 910 | 4,100 (-170, +230) | 3,200 (-110, +150)
gather_dim1 | 1,200 | 4,400 (-190, +240) | 3,200 (-120, +150)
scatter_1D | 1,100 | 2,800 (-120, +160) | 1,700 (-64 , +81)
scatter_dim0 | 1,000 | 2,900 (-130, +160) | 1,900 (-72 , +95)
scatter_dim1 | 1,200 | 3,200 (-130, +170) | 1,900 (-67 , +87)
scatter_add_1D | 1,100 | 2,800 (-120, +150) | 1,700 (-68 , +89)
scatter_add_dim0 | 1,000 | 2,900 (-120, +150) | 1,900 (-77 , +93)
scatter_add_dim1 | 1,300 | 3,100 (-140, +180) | 1,900 (-76 , +92)
cumsum_1D | 1,000 | 4,600 (-200, +260) | 3,600 (-120, +170)
cumsum_dim0 | 860 | 4,500 (-190, +240) | 3,700 (-140, +180)
cumsum_dim1 | 1,200 | 4,800 (-210, +260) | 3,700 (-130, +180)
cumprod_1D | 1,000 | 4,600 (-200, +270) | 3,600 (-130, +170)
cumprod_dim0 | 910 | 4,600 (-210, +270) | 3,700 (-130, +170)
cumprod_dim1 | 1,200 | 4,900 (-220, +290) | 3,700 (-130, +170)
min_dim0 | 280 | 5,900 (-220, +270) | 5,600 (-220, +260)
min_dim1 | 560 | 6,200 (-230, +310) | 5,600 (-230, +270)
max_dim0 | 320 | 5,900 (-220, +280) | 5,600 (-200, +250)
max_dim1 | 540 | 6,100 (-250, +310) | 5,600 (-200, +250)
std (reference) | 58 | 4,300 (-180, +280) | 4,200 (-160, +200)
clamp (reference) | 87 | 3,400 (-160, +220) | 3,400 (-140, +170)
argmin (reference) | -85 | 3,900 (-170, +250) | 4,000 (-170, +200)
sum (reference) | -11 | 4,200 (-180, +240) | 4,200 (-160, +190)
x < y (reference) | 110 | 3,700 (-170, +290) | 3,500 (-140, +150)
max(x, y) (reference) | 170 | 3,600 (-170, +200) | 3,400 (-140, +180)
* Times in nanoseconds
**Deltas: positive is improvement, negative is regression.
```
**Second run:**
```
Delta (median) Master (25%, 75%) Branch (25%, 75%)
---------------------------------------------------------------------------------------------------------
gather_1D | 850 | 3,900 (-130, +150) | 3,000 (-110, +130)
gather_dim0 | 860 | 4,000 (-140, +150) | 3,200 (-110, +150)
gather_dim1 | 1,200 | 4,300 (-160, +160) | 3,200 (-110, +150)
scatter_1D | 1,100 | 2,700 (-98 , +110) | 1,700 (-64 , +83)
scatter_dim0 | 950 | 2,800 (-100, +110) | 1,900 (-67 , +88)
scatter_dim1 | 1,200 | 3,100 (-120, +140) | 1,900 (-69 , +88)
scatter_add_1D | 1,100 | 2,700 (-92 , +110) | 1,700 (-65 , +95)
scatter_add_dim0 | 960 | 2,800 (-100, +100) | 1,900 (-74 , +100)
scatter_add_dim1 | 1,200 | 3,100 (-100, +130) | 1,900 (-72 , +100)
cumsum_1D | 960 | 4,500 (-140, +190) | 3,600 (-130, +170)
cumsum_dim0 | 820 | 4,500 (-140, +180) | 3,700 (-130, +170)
cumsum_dim1 | 1,100 | 4,800 (-160, +200) | 3,600 (-120, +170)
cumprod_1D | 960 | 4,500 (-130, +190) | 3,600 (-130, +180)
cumprod_dim0 | 820 | 4,500 (-150, +190) | 3,700 (-130, +180)
cumprod_dim1 | 1,100 | 4,800 (-150, +220) | 3,700 (-130, +180)
min_dim0 | 260 | 5,800 (-210, +250) | 5,500 (-200, +230)
min_dim1 | 580 | 6,100 (-230, +270) | 5,500 (-200, +220)
max_dim0 | 250 | 5,800 (-210, +230) | 5,600 (-170, +210)
max_dim1 | 520 | 6,100 (-220, +240) | 5,600 (-180, +210)
std (reference) | 170 | 4,300 (-210, +220) | 4,100 (-160, +190)
clamp (reference) | 140 | 3,400 (-140, +170) | 3,300 (-120, +170)
argmin (reference) | -51 | 3,800 (-170, +190) | 3,900 (-140, +160)
sum (reference) | -58 | 4,100 (-160, +170) | 4,200 (-170, +190)
x < y (reference) | 64 | 3,600 (-150, +210) | 3,500 (-140, +180)
max(x, y) (reference) | 120 | 3,500 (-130, +150) | 3,400 (-130, +150)
* Times in nanoseconds
**Deltas: positive is improvement, negative is regression.
```
CC ilia-cher VitalyFedyunin glaringlee gdankel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36875
Differential Revision: D21173011
Pulled By: robieta
fbshipit-source-id: 2067ab62f8f8d7b50e20a486a262864480699bbe