One more small Perf Tweak to fill_ (#110294)
# Summary
Perf win by check which device tensors are on
## Before this PR:
``` Shell
CPU | CPU: 1.3328152848407626
GPU | GPU: 6.614773320034146
CPU | GPU: 29.027153505012393
GPU | CPU: 17.22372299991548
```
## After this PR
``` Shell
CPU | CPU: 1.4241038949694484
GPU | GPU: 7.060713530518115
CPU | GPU: 15.149936103262007
GPU | CPU: 5.774620908778161
```
#### Repro Script
``` Python
a = torch.tensor([0.2, 0.5], device="cpu")
amax = torch.tensor(0.5, device="cpu")
print(f"CPU | CPU: {benchmark_torch_function_in_microseconds(torch.fill_, a, amax)}")
a = torch.tensor([0.2, 0.5], device="cuda")
amax = torch.tensor(0.5, device="cuda")
print(f"GPU | GPU: {benchmark_torch_function_in_microseconds(torch.fill_, a, amax)}")
a = torch.tensor([0.2, 0.5], device="cpu")
amax = torch.tensor(0.5, device="cuda")
print(f"CPU | GPU: {benchmark_torch_function_in_microseconds(torch.fill_, a, amax)}")
a = torch.tensor([0.2, 0.5], device="cuda")
amax = torch.tensor(0.5, device="cpu")
print(f"GPU | CPU: {benchmark_torch_function_in_microseconds(torch.fill_, a, amax)}")
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110294
Approved by: https://github.com/mikaylagawarecki