onnxruntime
e525ea22 - [CPU] Optimize GQA attention bias application for FP16 (#25871)

Commit
181 days ago
[CPU] Optimize GQA attention bias application for FP16 (#25871) ### Description When using attention bias input for GQA op with FP16, on the platforms that don't natively support FP16 math a cast to fp32 needs to be performed, and thus a temporary buffer needs to be created to store the fp32 values. The issue is that this temporary buffer was being allocated / deallocated inside of a loop for every token being processed. Refactored the implementation so that the allocation takes place only once. Phi model throughput increased by 15%.
Parents
Loading