optimize threading of mha (#20088)

Commit

1 year ago

optimize threading of mha (#20088) ### Description  The cost computation of ComputeVxAttentionScore is wrong. It should be sequence_length * v_head_size * total_sequence_length instead of sequence_length * v_head_size * sequence_length. The PR also fine-tuned the cost computation. on my local box with i9 cpu, the performance is same as unfused version, but it is much faster on an azure vm with 16 threads. ### Motivation and Context  https://github.com/microsoft/onnxruntime/issues/19924

References

#20088 - optimize threading of mha

Author

yufenglee

Parents

9d06e1bf

onnxruntime 91654988 - optimize threading of mha (#20088)

onnxruntime
91654988 - optimize threading of mha (#20088)