Add support for logaddexp(float16) in CUDA and implement its reference (#91869)
The reference is implemented so that it generates efficient and
numerically stable triton code.
Fixes https://github.com/pytorch/pytorch/issues/91683
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91869
Approved by: https://github.com/ngimel