[SR] Optimize VarStack (#68750)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68750
There was some room for optimization in static runtime's `prim::VarStack`:
* Avoid refcount bumps - constructing the `std::vector<at::Tensor>` can be avoided by writing a custom version of `stack_out` that takes a `std::vector<at::Tensor*>`
* Skip the memory overlap check
* Avoid device dispatcher overhead in a few places (e.g. `tensor.unsqueeze -> at::native::unsqueeze`)
Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- Stack`
Reviewed By: swolchok
Differential Revision: D32596934
fbshipit-source-id: e8f0ccea37c48924cb4fccbfdac4e1e11da95ee0