Optimize GELU BFloat16 Impl in CPU path (#79378)
### Description
For slow path (with non-contiguous inputs) with `none` or `tanh` approximate, current bfloat16 impl is not performance friendly in ATen. This PR uses float32 as an immediate type, in order to reduce the heavy cost of converting bf16 to fp32.
### Test
IceLake 2S 32C (Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz)
**single socket (32 cores):**
approximate is `none`:
|input shapes | forward ( base) (ms) | backward (base) (ms) | forward (optimized) (ms) | backward (optimized) (ms)
|--|------| --| --| --|
|[16, 32, 32] | 0.361 | 1.055 | 0.348 | 0.672
|[32, 32, 64] | 0.084 | 2.003 | 0.076 | 1.426
|[32, 64, 128] | 0.237 | 2.007 | 0.22 | 1.454
|[64, 128, 128] | 2.23 | 6.348 | 1.943 | 4.103
approximate is `tanh`:
|input shapes | forward ( base) (ms) | backward (base) (ms) | forward (optimized) (ms) | backward (optimized) (ms)
|--|------| --| --| --|
[16, 32, 32] | 0.203 | 1.209 | 0.138 | 0.474
[32, 32, 64] | 0.063 | 2.497 | 0.043 | 0.985
[32, 64, 128] | 0.201 | 2.707 | 0.141 | 1.205
[64, 128, 128] | 1.549 | 8.749 | 1.065 | 3.635
**single core:**
approximate is `none`:
|input shapes | forward ( base) (ms) | backward (base) (ms) | forward (optimized) (ms) | backward (optimized) (ms)
|--|------| --| --| --|
[16, 32, 32] | 0.359 | 1.055 | 0.267 | 0.592
[32, 32, 64] | 1.11 | 3.483 | 1.063 | 2.373
[32, 64, 128] | 4.478 | 13.866 | 4.27 | 9.426
[64, 128, 128] | 17.675 | 55.231 | 16.805 | 37.509
approximate is `tanh`:
|input shapes | forward ( base) (ms) | backward (base) (ms) | forward (optimized) (ms) | backward (optimized) (ms)
|--|------| --| --| --|
[16, 32, 32] | 0.202 | 1.212 | 0.138 | 0.473
[32, 32, 64] | 0.776 | 4.843 | 0.531 | 1.872
[32, 64, 128] | 3.203 | 19.267 | 2.16 | 7.243
[64, 128, 128] | 12.33 | 76.834 | 8.286 | 29.553
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79378
Approved by: https://github.com/mingfeima