Optimize FlashAttention for M4 Max (20x speedup) (#27780)
MultiHeadAttention
Before: 58.3s
After: 2.89
Speedup: 20x
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Tested with vision_encoder.onnx for
https://huggingface.co/onnx-community/LightOnOCR-2-1B-ONNX