CUDA `addcmul` and `addcdiv` do math in float for 16 bits I/O (#60715)
Summary:
Currently foreach `addcmul` and `addcdiv` cast scalar to float so that actual math is done in FP32 when tensor dtype is Float16/BFloat16 while regular `addcmul` and `addcdiv`, not.
### Reproducible steps to see the behavioral difference
```ipython
In [1]: import torch; torch.__version__
Out[1]: '1.9.0'
In [2]: a, b, c = torch.tensor([60000.0], device='cuda', dtype=torch.half), torch.tensor([60000.0], device='cuda', dtype=torch.half), torch.tensor([-1.0], device='cuda', dtype=torch.half)
In [4]: torch.addcmul(a, b, c, value=2)
Out[4]: tensor([-inf], device='cuda:0', dtype=torch.float16)
In [5]: torch._foreach_addcmul([a], [b], [c], value=2)[0]
Out[5]: tensor([-60000.], device='cuda:0', dtype=torch.float16)
```
### How foreach casts?
Foreach addcmul and addcdiv cast scalar to `opmath_t` (almost equivalent to acc_type) here: https://github.com/pytorch/pytorch/blob/42c8439b6eaccf175cceaa820452583e2459a521/aten/src/ATen/native/cuda/ForeachPointwiseOp.cu#L30 and cast inputs and results here:
https://github.com/pytorch/pytorch/blob/42c8439b6eaccf175cceaa820452583e2459a521/aten/src/ATen/native/cuda/ForeachFunctors.cuh#L133-L135
Related to https://github.com/pytorch/pytorch/issues/58833 #60227 https://github.com/pytorch/pytorch/issues/60454
cc ptrblck mcarilli ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60715
Reviewed By: albanD
Differential Revision: D29385715
Pulled By: ngimel
fbshipit-source-id: 8bb2db19ab66fc99d686de056a6ee60f9f71d603