[PyTorch] Avoid refcount bumps in addmm_out_cuda_impl (#54935)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54935
Bunch of avoidable copying of Tensor objects, which results in a refcount bump.
ghstack-source-id: 125216023
Test Plan:
Compared percentage of self time spent in addmm_out_cuda_impl while running the following sample:
```
+import torch
+import torch.nn as nn
+
+m = nn.Linear(1024, 1024).cuda().half()
+x = torch.randn(16, 1024).cuda().half()
+while True: y = m(x)
```
in perf record, decreased from 0.74% to 0.56%.
Reviewed By: ngimel
Differential Revision: D27420388
fbshipit-source-id: d2c5e4c4899cd02c60c45735b2d72c4ed913f6e8