Add BFloat16 support and optimization for mish, hardtanh backward, and silu on CPU (#82460)
### Description
* add BFloat16 support for mish and hardtanh backward on CPU.
* optimize the performance for silu
### Testing
- optimize the performance for silu: bfloat16
single socket (28 cores):
```
before: 1x128x1024 forward 0.090 s backward 0.218 s
10x128x1024 forward 0.146 s backward 0.314 s
after: 1x128x1024 forward 0.064 s backward 0.100 s
10x128x1024 forward 0.085 s backward 0.133 s
```
single core:
```
before: 1x128x1024 forward 0.300 s backward 0.606 s
10x128x1024 forward 2.825 s backward 5.834 s
after: 1x128x1024 forward 0.156 s backward 0.239 s
10x128x1024 forward 1.447 s backward 2.165 s
```
- Add BFloat16 support for mish and backward of hardtanh on CPU.
single socket (20 cores):
op | shape | fp32 / s | fp32 / s | bf16 / s | bf16 / s
-- | -- | -- | -- | -- | --
| | forward | backward | forward | backward
silu | [10, 128, 10, 10] | 4.41E-05 | 7.67E-05 | 5.32E-05 | 9.38E-05
| [10, 128, 80, 80] | 0.0008 | 0.001788 | 0.00067 | 0.001031
mish | [10, 128, 10, 10] | 0.000356 | 0.000427 | 0.000367 | 0.000436
| [10, 128, 80, 80] | 0.004527 | 0.005807 | 0.004757 | 0.005393
hardtanh | [10, 128, 10, 10] | / | 3.97E-05 | / | 4.45E-05
| [10, 128, 80, 80] | / | 0.001748 | / | 0.000645
single core:
op | shape | fp32 / s | fp32 / s | bf16 / s | bf16 / s
-- | -- | -- | -- | -- | --
| | forward | backward | forward | backward
silu | [10, 128, 10, 10] | 1.17E-04 | 1.91E-04 | 1.35E-04 | 2.23E-04
| [10, 128, 80, 80] | 0.007434 | 0.013141 | 0.008464 | 0.013044
mish | [10, 128, 10, 10] | 0.00103 | 0.00122 | 0.00106 | 0.001227
| [10, 128, 80, 80] | 0.065629 | 0.078418 | 0.067779 | 0.077214
hardtanh | [10, 128, 10, 10] | / | 1.18E-04 | / | 9.30E-05
| [10, 128, 80, 80] | / | 0.010773 | / | 0.005834
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82460
Approved by: https://github.com/mingfeima, https://github.com/malfet