optimize threading of mha (#20088)
### Description
<!-- Describe your changes. -->
The cost computation of ComputeVxAttentionScore is wrong. It should be
sequence_length * v_head_size * total_sequence_length instead of
sequence_length * v_head_size * sequence_length.
The PR also fine-tuned the cost computation.
on my local box with i9 cpu, the performance is same as unfused version,
but it is much faster on an azure vm with 16 threads.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
https://github.com/microsoft/onnxruntime/issues/19924