onnxruntime
ad382120 - [CUDA] enable causal in MultiHeadAttention (#21852)

Commit

1 year ago

[CUDA] enable causal in MultiHeadAttention (#21852) ### Description Enable causal in MultiHeadAttention cuda operator. All formats (Q_K_V_BSNH_BSNH_BSNH, Q_K_V_BSNH_BNSH_BNSH, Q_KV_BSNH_BSN2H and QKV_BSN3H) supports causal for now. Internally, casual will be dispatch to flash attention, efficient attention or unfused attention kernel. ### Motivation and Context Currently, MultiHeadAttention has causal enabled in CPU ep, but not in CUDA ep. It could cause issues in onnx conversion, like some model can run in CPU but not in CUDA. Enable causal in CUDA will reduce the difference of support matrix of CPU/CUDA.

References

#21852 - [CUDA] enable causal in MultiHeadAttention

Author

tianleiwu

Parents

d9c57ac7

onnxruntime ad382120 - [CUDA] enable causal in MultiHeadAttention (#21852)

onnxruntime
ad382120 - [CUDA] enable causal in MultiHeadAttention (#21852)