Use opmath_type for CUDA logcumsumexp (#83425)
This improves precision by reducing the number of narrowing
conversions, as well as reducing compile times from 2m 30s to 1m 25s
on my machine.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83425
Approved by: https://github.com/ngimel