Improve precision and performance for BFloat16 upsampling (#91169)
### Description
- Fix precision issue for BFloat16 upsampling: https://github.com/pytorch/pytorch/issues/89212
- Improve performance for BFloat16 upsampling.
### Testing
data type: BFloat16
- Single core
contiguous:
mode | scale_factor | shape | before backward / ms | after backward / ms
-- | -- | -- | -- | --
nearest | 2 | [10, 3, 200, 200] | 14.47 | 8.34
linear | 2 | [3, 200, 200] | 3.69 | 2.74
bilinear | 2 | [3, 5, 200, 200] | 87.99 | 49.05
trilinear | 2 | [3, 3, 3, 100, 100] | 171.02 | 72.53
bicubic | 2 | [3, 3, 200, 200 ] | 176.29 | 78
channels last:
mode | scale_factor | shape | before backward / ms | after backward / ms
-- | -- | -- | -- | --
nearest | 2 | [10, 3, 200, 200] | 17.70 | 10.30
linear | 2 | [3, 200, 200] | \ | \
bilinear | 2 | [3, 5, 200, 200] | 50.90 | 18.83
trilinear | 2 | [3, 3, 3, 100, 100] | 121.56 | 42.60
bicubic | 2 | [3, 3, 200, 200 ] | 179.40 | 80
- 20 cores
contiguous:
mode | scale_factor | shape | before backward / ms | after backward / ms
-- | -- | -- | -- | --
nearest | 2 | [10, 3, 200, 200] | 1.17 | 1.01
linear | 2 | [3, 200, 200] | 0.41 | 0.26
bilinear | 2 | [3, 5, 200, 200] | 7.19 | 4.07
trilinear | 2 | [3, 3, 3, 100, 100] | 21.32 | 9.33
bicubic | 2 | [3, 3, 200, 200 ] | 178.67 | 10
channels last:
mode | scale_factor | shape | before backward / ms | after backward / ms
-- | -- | -- | -- | --
nearest | 2 | [10, 3, 200, 200] | 2.25 | 1.55
linear | 2 | [3, 200, 200] | \ | \
bilinear | 2 | [3, 5, 200, 200] | 20.17 | 7.20
trilinear | 2 | [3, 3, 3, 100, 100] | 43.33 | 15.66
bicubic | 2 | [3, 3, 200, 200 ] | 176.76 | 10
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91169
Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/Skylion007