onnxruntime
b52e8bf7 - [oneDNN ep] QAttention BF16 and GPU support added (#13793)

Commit

3 years ago

[oneDNN ep] QAttention BF16 and GPU support added (#13793) ### Description QAttention performance improvement when hardware supports amx and avx-bf16 execution. ### Motivation and Context - Streamlined the code to dynamically switch between BF16 and FP32 execution as and when supported by hardware - Split QKV memory into three different memories for Q, K, and V. This helps to run QAttention on GPU and take advantage of parallel processing. - This change has shown a significant amount of performance gain for QAttention operator on hardware like Sapphire Rapids which supports amx and avx-bf16.

References

#13793 - [oneDNN ep] QAttention BF16 and GPU support added

Author

sunnyshu-intel

Parents

c8826014

onnxruntime b52e8bf7 - [oneDNN ep] QAttention BF16 and GPU support added (#13793)

onnxruntime
b52e8bf7 - [oneDNN ep] QAttention BF16 and GPU support added (#13793)