pytorch
f630294f - Optimize GELU BFloat16 Impl in CPU path (#79378)

Commit
1 year ago
Optimize GELU BFloat16 Impl in CPU path (#79378) ### Description For slow path (with non-contiguous inputs) with `none` or `tanh` approximate, current bfloat16 impl is not performance friendly in ATen. This PR uses float32 as an immediate type, in order to reduce the heavy cost of converting bf16 to fp32. ### Test IceLake 2S 32C (Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz) **single socket (32 cores):** approximate is `none`: |input shapes  | forward ( base) (ms) | backward (base) (ms) | forward (optimized) (ms) | backward (optimized) (ms) |--|------| --| --| --| |[16, 32, 32] | 0.361 | 1.055 | 0.348 | 0.672 |[32, 32, 64] | 0.084 | 2.003 | 0.076 | 1.426 |[32, 64, 128] | 0.237 | 2.007 | 0.22 | 1.454 |[64, 128, 128] | 2.23 | 6.348 | 1.943 | 4.103 approximate is `tanh`: |input shapes  | forward ( base) (ms) | backward (base) (ms) | forward (optimized) (ms) | backward (optimized) (ms) |--|------| --| --| --| [16, 32, 32] | 0.203 | 1.209 | 0.138 | 0.474 [32, 32, 64] | 0.063 | 2.497 | 0.043 | 0.985 [32, 64, 128] | 0.201 | 2.707 | 0.141 | 1.205 [64, 128, 128] | 1.549 | 8.749 | 1.065 | 3.635 **single core:** approximate is `none`: |input shapes  | forward ( base) (ms) | backward (base) (ms) | forward (optimized) (ms) | backward (optimized) (ms) |--|------| --| --| --| [16, 32, 32] | 0.359 | 1.055 | 0.267 | 0.592 [32, 32, 64] | 1.11 | 3.483 | 1.063 | 2.373 [32, 64, 128] | 4.478 | 13.866 | 4.27 | 9.426 [64, 128, 128] | 17.675 | 55.231 | 16.805 | 37.509 approximate is `tanh`: |input shapes  | forward ( base) (ms) | backward (base) (ms) | forward (optimized) (ms) | backward (optimized) (ms) |--|------| --| --| --| [16, 32, 32] | 0.202 | 1.212 | 0.138 | 0.473 [32, 32, 64] | 0.776 | 4.843 | 0.531 | 1.872 [32, 64, 128] | 3.203 | 19.267 | 2.16 | 7.243 [64, 128, 128] | 12.33 | 76.834 | 8.286 | 29.553 Pull Request resolved: https://github.com/pytorch/pytorch/pull/79378 Approved by: https://github.com/mingfeima
Author
Committer
Parents
Loading