pytorch
6372f11d - RowwiseMoments: use float as acc type for bfloat16 inputs (#84405)

Commit
2 years ago
RowwiseMoments: use float as acc type for bfloat16 inputs (#84405) To fix https://github.com/pytorch/pytorch/issues/77507 Originally `utils::RowwiseMoments<BFloat16>` will still accululate on BFloat16, which is not only slow but also introducing additional rounding errors. This patch will do accumulation on float for the bfloat16 inputs: each of bfloat16 vec (size 16) will be converted to two float vec (size 8), and accumulated on m1(mean) and m2(rstd) vecs which are all float vecs. No effect on float performance, will improve bfloat16 performance: * avx512 single socket: ``` before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.210 ms; bf16: 0.770 ms after: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.215 ms; bf16: 0.178 ms ``` * avx512 single core: ``` before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 2.661 ms; bf16: 12.267 ms after: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 2.618 ms; bf16: 2.309 ms ``` * avx2 single socket: ``` before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.540 ms; bf16: 2.030 ms after: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.527 ms; bf16: 0.458 ms ``` * avx2 single core: ``` before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 4.349 ms; bf16: 19.252 ms after: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 4.416 ms; bf16: 3.524 ms ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/84405 Approved by: https://github.com/jgong5
Author
Committer
Parents
Loading