onnxruntime
e525ea22 - [CPU] Optimize GQA attention bias application for FP16 (#25871)

Commit

258 days ago

[CPU] Optimize GQA attention bias application for FP16 (#25871) ### Description When using attention bias input for GQA op with FP16, on the platforms that don't natively support FP16 math a cast to fp32 needs to be performed, and thus a temporary buffer needs to be created to store the fp32 values. The issue is that this temporary buffer was being allocated / deallocated inside of a loop for every token being processed. Refactored the implementation so that the allocation takes place only once. Phi model throughput increased by 15%.

References

#25871 - [CPU] Optimize GQA attention bias application for FP16

Author

derdeljan-msft

Parents

0b152002

onnxruntime e525ea22 - [CPU] Optimize GQA attention bias application for FP16 (#25871)

onnxruntime
e525ea22 - [CPU] Optimize GQA attention bias application for FP16 (#25871)