Use `accscalar_t` for CUDA add/sub with Tensor and Scalar (#60454)
Summary:
Follow up of https://github.com/pytorch/pytorch/issues/60227, related to https://github.com/pytorch/pytorch/issues/59907 & https://github.com/pytorch/pytorch/issues/58833
With this pull request, `torch.add` & `torch.sub` use `acc_type` for `Scalar` if either of two arguments is `Scalar`.
This mimics the behavior of [`torch.mul`](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cuda/BinaryMulDivKernel.cu#L18), `torch._foreach_(add|sub).Scalar` and `torch._foreach_(add|sub).ScalarList`.
---
**reference**
- torch.mul CUDA kernel: https://github.com/pytorch/pytorch/blob/b0c9762e2d1dfcde549344628ad6be063378ef6a/aten/src/ATen/native/cuda/BinaryMulDivKernel.cu#L17-L25
- `torch._foreach_(add|sub).Scalar`: cast scalar https://github.com/pytorch/pytorch/blob/b0c9762e2d1dfcde549344628ad6be063378ef6a/aten/src/ATen/native/cuda/ForeachBinaryOpScalar.cu#L27
- `torch._foreach_(add|sub).ScalarList`: `BinaryOpScalarListFunctor` https://github.com/pytorch/pytorch/blob/b0c9762e2d1dfcde549344628ad6be063378ef6a/aten/src/ATen/native/cuda/ForeachFunctors.cuh#L180-L182 and multi_tensor_apply handles `scalar_t` and computes `opmath_t` (almost equivalent `accscalar_t`) https://github.com/pytorch/pytorch/blob/b0c9762e2d1dfcde549344628ad6be063378ef6a/aten/src/ATen/native/cuda/MultiTensorApply.cuh#L60-L68. BinaryOpScalarListFunctor
is used https://github.com/pytorch/pytorch/blob/b0c9762e2d1dfcde549344628ad6be063378ef6a/aten/src/ATen/native/cuda/ForeachBinaryOpScalarList.cu#L24
cc ngimel ptrblck mcarilli
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60454
Reviewed By: VitalyFedyunin
Differential Revision: D29345035
Pulled By: ngimel
fbshipit-source-id: 5dbafbdfe029a9544ec2e58f17d547928e017a04