pytorch
c44d9d9f - Use cascade-summation to improve nansum accuracy (#61082)

Commit
3 years ago
Use cascade-summation to improve nansum accuracy (#61082) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61082 Fixes #59415 This implements nansum as a new `LoadPolicy` for the existing sum functions. So, it's using the more accurate cascade-sum algorithm. I've also expanded `test_nansum` to cover the four special cases of the sum algorithm (inner/outer reduction; vectorized or scalar). Nansum performance comparison ----------------------------- For float sums, contiguous reductions are as much as 10x faster and discontiguous sums are ~1.8x faster (more for small shapes due to TensorIterator overheads). | Shape | Dim | Master Contiguous (us) | This PR Contiguous (us) | Master Discontiguous (us) | This PR Discontiguous (us) | |-------------:|-----|:----------------------:|:-----------------------:|:-------------------------:|:--------------------------:| | 10, 1000 | 0 | 74.9 | 2.02 | 75.6 | 6.41 | | | 1 | 8.24 | 1.8 | 8.28 | 5.24 | | 100, 1000 | 0 | 134 | 7.55 | 130 | 43.2 | | | 1 | 70.5 | 7.01 | 71.5 | 40.6 | | 1000, 1000 | 0 | 726 | 69.2 | 737 | 403 | | | 1 | 702 | 51.0 | 709 | 404 | | 10000, 1000 | 0 | 15,300 | 2,470 | 18,200 | 10,400 | | | 1 | 7,200 | 1,160 | 7,470 | 4,440 | | 100000, 1000 | 0 | 163,000 | 28,000 | 199,000 | 131,000 | | | 1 | 70,700 | 13,500 | 75,700 | 44,200 | Sum performace comparison ------------------------- For float sums, performance is unchanged to within measurement precision: | Shape | Dim | Master Contiguous (us) | This PR Contiguous (us) | Master Discontiguous (us) | This PR Discontiguous (us) | |-------------:|-----|:----------------------:|:-----------------------:|:-------------------------:|:--------------------------:| | 10, 1000 | 0 | 1.92 | 2.01 | 4.2 | 4.49 | | | 1 | 1.68 | 1.68 | 2.79 | 2.75 | | 100, 1000 | 0 | 6.52 | 7.07 | 26.9 | 27.3 | | | 1 | 5.91 | 5.66 | 16.8 | 16.9 | | 1000, 1000 | 0 | 55.6 | 58.6 | 256 | 254 | | | 1 | 41.0 | 41.2 | 150 | 147 | | 10000, 1000 | 0 | 1,370 | 1,650 | 8,070 | 8,020 | | | 1 | 908 | 845 | 3,100 | 2,980 | | 100000, 1000 | 0 | 24,700 | 24,700 | 90,900 | 91,000 | | | 1 | 12,500 | 12,100 | 31,500 | 31,800 | Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D29753523 Pulled By: ngimel fbshipit-source-id: 28095ac39e4a07ff878775c98f7a7815d9a4e457
Author
Parents
Loading