don't unsqueeze every stack arg if possible (#70288)
Summary:
Fixes T98738497
Use `cat` and `view` if possible, instead of unsqueezing every arg. Helps perf when there are a lot of small arguments to `stack`.
Benchmark:
```
import torch
from torch.utils.benchmark import Timer
inputs = [torch.randn([1, 128]) for _ in range(500)]
out = torch.empty(1,500,128)
def stack_cat(inputs):
cat_result = torch.concat(inputs, dim=1)
return cat_result.view( [1, 500, 128])
timer_stack = Timer(stmt="torch.stack(inputs, dim=1)", globals=globals())
timer_cat = Timer(stmt="stack_cat(inputs)", globals=globals())
print("stack ", timer_stack.blocked_autorange().median)
print("cat ", timer_cat.blocked_autorange().median)
```
Before:
```
stack 0.00023390522226691247
cat 7.437262553721667e-05
```
After
```
stack 7.397504318505526e-05
cat 7.37407322973013e-05
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70288
Reviewed By: robieta, mruberry
Differential Revision: D33289789
Pulled By: ngimel
fbshipit-source-id: b57dcb8ec66e767f552c08deeba330f31ae6c3d0