Fix permuted sum precision issue for lower precision on CPU (#108559)
Fixes #83149
There is a limitation of `TensorIterator` reductions:
The non-permuted input tensor will be coalesced down to a 2-d tensor by `TensorIterator` whereas the permuted case may become a >2d operation (for example, two reduced dimensions and non-reduced dim).
Since the cpu reduction loop of `TensorIterator` only operates on two dimensions at a time, this means the intermediate sums will be truncated to lower precision.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108559
Approved by: https://github.com/mingfeima, https://github.com/peterbell10