Use cascade-summation to improve nansum accuracy (#61082)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61082
Fixes #59415
This implements nansum as a new `LoadPolicy` for the existing sum functions.
So, it's using the more accurate cascade-sum algorithm.
I've also expanded `test_nansum` to cover the four special cases of the sum
algorithm (inner/outer reduction; vectorized or scalar).
Nansum performance comparison
-----------------------------
For float sums, contiguous reductions are as much as 10x faster and discontiguous sums are ~1.8x faster (more for small shapes due to TensorIterator overheads).
| Shape | Dim | Master Contiguous (us) | This PR Contiguous (us) | Master Discontiguous (us) | This PR Discontiguous (us) |
|-------------:|-----|:----------------------:|:-----------------------:|:-------------------------:|:--------------------------:|
| 10, 1000 | 0 | 74.9 | 2.02 | 75.6 | 6.41 |
| | 1 | 8.24 | 1.8 | 8.28 | 5.24 |
| 100, 1000 | 0 | 134 | 7.55 | 130 | 43.2 |
| | 1 | 70.5 | 7.01 | 71.5 | 40.6 |
| 1000, 1000 | 0 | 726 | 69.2 | 737 | 403 |
| | 1 | 702 | 51.0 | 709 | 404 |
| 10000, 1000 | 0 | 15,300 | 2,470 | 18,200 | 10,400 |
| | 1 | 7,200 | 1,160 | 7,470 | 4,440 |
| 100000, 1000 | 0 | 163,000 | 28,000 | 199,000 | 131,000 |
| | 1 | 70,700 | 13,500 | 75,700 | 44,200 |
Sum performace comparison
-------------------------
For float sums, performance is unchanged to within measurement precision:
| Shape | Dim | Master Contiguous (us) | This PR Contiguous (us) | Master Discontiguous (us) | This PR Discontiguous (us) |
|-------------:|-----|:----------------------:|:-----------------------:|:-------------------------:|:--------------------------:|
| 10, 1000 | 0 | 1.92 | 2.01 | 4.2 | 4.49 |
| | 1 | 1.68 | 1.68 | 2.79 | 2.75 |
| 100, 1000 | 0 | 6.52 | 7.07 | 26.9 | 27.3 |
| | 1 | 5.91 | 5.66 | 16.8 | 16.9 |
| 1000, 1000 | 0 | 55.6 | 58.6 | 256 | 254 |
| | 1 | 41.0 | 41.2 | 150 | 147 |
| 10000, 1000 | 0 | 1,370 | 1,650 | 8,070 | 8,020 |
| | 1 | 908 | 845 | 3,100 | 2,980 |
| 100000, 1000 | 0 | 24,700 | 24,700 | 90,900 | 91,000 |
| | 1 | 12,500 | 12,100 | 31,500 | 31,800 |
Test Plan: Imported from OSS
Reviewed By: mrshenli
Differential Revision: D29753523
Pulled By: ngimel
fbshipit-source-id: 28095ac39e4a07ff878775c98f7a7815d9a4e457