[fix] mul chalf: cpu scalar and cuda tensor case
https://github.com/pytorch/pytorch/pull/76158 added `chalf` support for `mul` on CUDA incorrectly.
Following sample
```python
torch.mul(torch.zeros(3, device='cuda'), 2.5) # CUDA Tensor and CPU Scalar
```
fails with
```
RuntimeError: iter.device(arg).is_cuda() INTERNAL ASSERT FAILED at "../aten/src/ATen/native/cuda/JitLoops.cuh":83, please report a bug to PyTorch. argument 2: expected a CUDA device but found cpu
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76364
Approved by: https://github.com/mruberry