improve accuracy of logsoftmax computation on cuda (#38945)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/38839. Previously, if magnitude of input values was large, when computing `max+log(sum)` the `log(sum)` value was essentially ignored, now the result is computed as
`x-max-log(sum)` which has a better chance of preserving accuracy.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38945
Differential Revision: D21712483
Pulled By: ngimel
fbshipit-source-id: c1a3599ed981ba7a7fd130cbd7040a706b7eace0