onnxruntime
b52e8bf7 - [oneDNN ep] QAttention BF16 and GPU support added (#13793)

Commit
3 years ago
[oneDNN ep] QAttention BF16 and GPU support added (#13793) ### Description QAttention performance improvement when hardware supports amx and avx-bf16 execution. ### Motivation and Context - Streamlined the code to dynamically switch between BF16 and FP32 execution as and when supported by hardware - Split QKV memory into three different memories for Q, K, and V. This helps to run QAttention on GPU and take advantage of parallel processing. - This change has shown a significant amount of performance gain for QAttention operator on hardware like Sapphire Rapids which supports amx and avx-bf16.
Parents
Loading