torch._numpy: keep f16 CUDA tensors in f16 where possible (#107768)
Contain workarounds for _RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'_ to CPU tensors, do computations on CUDA tensors in f16.
Fixes https://github.com/Quansight-Labs/numpy_pytorch_interop/issues/170
We do not really systematically test CUDA tensors in torch._numpy, so I only spot-checked locally that the affected functions work with `tensor.to("cuda")`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107768
Approved by: https://github.com/lezcano