Eliminate device guard in generic dispatch key kernel wrappers (#55131)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55131
Benchmark `zeros_out`:
```python
from torch.utils.benchmark import Timer
counts = Timer(
stmt="""at::zeros_out(t, {1});""",
setup="auto t = at::empty({1});",
language="cpp",
).collect_callgrind(number=1_000)
print(counts)
```
With device guard:
```
<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7f834f095ca0>
at::zeros_out(t, {1});
setup: auto t = at::empty({1});
All Noisy symbols removed
Instructions: 1396022 1396022
Baseline: 0 0
1000 runs per measurement, 1 thread
```
Without device guard:
```
<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7f25e48927c0>
at::zeros_out(t, {1});
setup: auto t = at::empty({1});
All Noisy symbols removed
Instructions: 1296022 1296022
Baseline: 0 0
1000 runs per measurement, 1 thread
```
We see about `7.7%` improvement.
ghstack-source-id: 126295368
Test Plan:
```
buck build //caffe2/aten/...
buck test mode/dev mode/no-gpu //caffe2/test:torch -- 'caffe2/test:torch - test_msnpu_error (test_torch.TestTorch)'
```
Reviewed By: ezyang
Differential Revision: D27496584
fbshipit-source-id: 97f783a809b77b28f77a93096d69b3da9ee69df7