[CPU] Optimize softmax as flash attention v2 (#118957)
### Descriptions
According to flash attention v2, optimize softmax by dividing sum out of the KV inner loop.
### Performance
Stable Diffusion V2.1 on GNR
| Version | Kernel time (s) | Speedup |
|---------|----------------|----------------|
| BF16 Before | 28.67 |
| BF16 After | 23.55 | 17.86% |
| FP32 Before | 54.20 |
| FP32 After | 49.47 | 8.73% |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118957
Approved by: https://github.com/jgong5, https://github.com/drisspg