Implement FlashAttention for CPU (#20805)
### Description
Implement [FlashAttention](https://arxiv.org/pdf/2205.14135) and
[FlashAttention-2](https://arxiv.org/pdf/2307.08691) for
MultiHeadAttention on CPU.
### Motivation and Context
Accelerate the execution of MultiHeadAttention.
Current performance: 10ms vs 16ms (com.microsoft.MultiHeadAttention) on
my Linux machine and 10ms vs 38ms (com.microsoft.MultiHeadAttention) on
my Windows machine. May need further optimizations.
---------
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: Qingnan Duan <qiduan@microsoft.com>