Accumulate 16-bit float sums in 32-bit accumulators (#60387)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60387
Fixes gh-59489
Using 32-bit accumulators is a win-win: improved precision and improved
performance since the half precision types needed to be converted back and forth
to 32-bit float to do the arithmetic anyway.
Note that on multi-threaded or dis-contiguous sums, there can be partial sums
stored in the output so they are necessarily trucated to 16-bit. Fixing this
would require a rework of TensorIterator reductions.
Test Plan: Imported from OSS
Reviewed By: jbschlosser
Differential Revision: D29447187
Pulled By: ngimel
fbshipit-source-id: d0619e0ca2fe116d101460142b79ca56fd6d0840