fix mul/div overflow issue on CPU float16 (#98820)
Fix https://github.com/pytorch/pytorch/issues/63482 and https://github.com/pytorch/pytorch/issues/98691
The above two issues have the same root cause:
**binary_ops** will create TensorIterator with the flag `promote_inputs_to_common_dtype` on, which will convert both input tensors to the common_dtype_ (the logic is bypassed on CUDA), which might overflow on Half. If one of the inputs is a scalar with abs value larger than ~65000, it will overflow.
This patch will try to fetch the scalar value from the `original_tensor_base` which records the original scalar input value, then in the `cpu_kernel_vec` the TensorIterator is treated as an unary Op.
So previously, CPU and CUDA would have different behaviors for such scenario. This is aligned with this patch, test cases added for both CPU and CUDA device.
The following is the results:
#### before:
```
>>> torch.tensor([3388.], dtype=torch.half).div(524288.0)
tensor([0.], dtype=torch.float16)
>>> torch.tensor([0.01], dtype=torch.float16) * torch.tensor(65536, dtype=torch.float32)
tensor([inf], dtype=torch.float16)
```
#### after:
```
>>> torch.tensor([3388.], dtype=torch.half).div(524288.0)
tensor([0.0065], dtype=torch.float16)
>>> torch.tensor([0.01], dtype=torch.float16) * torch.tensor(65536, dtype=torch.float32)
tensor([655.5000], dtype=torch.float16)
```
Also need to update `RRelu` implementation, to use float to store the intermediate results, otherwise the following test case would fail:
```
. build/bin/test_api --gtest_filter=ModulesTest.RReLU
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98820
Approved by: https://github.com/jgong5, https://github.com/ngimel