[Static Runtime] Native stack for contiguous inputs (#50863)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50863
- Avoid calling unsqueeze on every input tensor by copying data directly
- Model benchmark shows small improvement: -2.3% (b=1), -1.1% (b=20)
- This diff does not yet modify torch::stack implementation, only the static_runtime path. A followup diff will do this.
Test Plan:
# Test
```
buck test //caffe2/aten:native_test
buck run //caffe2/test:torch
```
# Op benchmark
expected no changes here because this diff only touches static runtime
```
Baseline |Native |Change
6.38 |6.336 |-0.69%
6.553 |6.588 |0.53%
14.904 |14.883 |-0.14%
5.657 |5.68 |0.41%
5.612 |5.795 |3.26%
6.051 |6.058 |0.12%
4.225 |4.252 |0.64%
4.24 |4.294 |1.27%
6.28 |4.249 |-32.34%
6.267 |4.257 |-32.07%
418.932 |404.356 |-3.48%
417.694 |404.752 |-3.10%
1592.455 |1583.277 |-0.58%
2919.261 |2685.636 |-8.00%
211.458 |202.838 |-4.08%
211.518 |203.229 |-3.92%
783.953 |792.198 |1.05%
1457.823 |1348.824 |-7.48%
2032.816 |1975.961 |-2.80%
2090.662 |2000.612 |-4.31%
6487.098 |6635.41 |2.29%
11874.702 |10853.302 |-8.60%
2123.83 |2039.272 |-3.98%
2195.453 |2221.82 |1.20%
6435.978 |6593.363 |2.45%
11852.205 |10858.92 |-8.38%
2036.526 |1983.042 |-2.63%
2055.618 |2072.03 |0.80%
6417.192 |6681.064 |4.11%
12468.744 |10888.336 |-12.67%
4959.704 |4954.734 |-0.10%
5121.823 |4996.84 |-2.44%
5082.105 |5029.652 |-1.03%
5395.936 |5438.628 |0.79%
5162.756 |5114.147 |-0.94%
23798.08 |21884.065 |-8.04%
4957.921 |4972.01 |0.28%
4971.234 |4968.977 |-0.05%
5005.909 |5039.95 |0.68%
5159.614 |5180.426 |0.40%
5013.221 |5202.684 |3.78%
20238.741 |20212.581 |-0.13%
7632.439 |7610.345 |-0.29%
7589.376 |7679.148 |1.18%
7859.937 |7850.485 |-0.12%
8214.213 |8150.846 |-0.77%
11606.562 |11724.139 |1.01%
34612.919 |34817.677 |0.59%
```
# Adindexer model benchmark
```
caffe2=0 batch={1|20} profile=1 ./scripts/bwasti/static_runtime/run.sh
```
## Baseline
```
Batch 1
0.00291311 ms. 3.97139%. aten::stack (1 nodes)
Batch 20
0.00477447 ms. 0.934081%. aten::stack (1 nodes)
```
## Native stack (this change)
```
Batch 1
0.00115161 ms. 1.67388%. aten::stack (1 nodes)
Batch 20
0.00264831 ms. 0.543767%. aten::stack (1 nodes)
```
Reviewed By: hlu1
Differential Revision: D25988638
fbshipit-source-id: 82ce84c88963cae40dc5819004baf03ce9093ecc
Author
Marat Subkhankulov