onnxruntime
5eac2c1f - relational attention bias cuda op (#14149)

Commit

3 years ago

relational attention bias cuda op (#14149) ### Description This cuda op implements the compute_bias() method in T5 Attention including the permutation. note: 1. bias_table needs to be saved in col-major. be careful when implementing fusion script 2. second input(sequence length) is placed on cpu. (using Shape node's output should be good) 3. the first dimension of output is 1, so extra_add_qk in attention should support broadcasting 4. compute_bias() only used in self-attn in t5 TODO: docs change will be applied later ### Motivation and Context It's part of the process of optimizing t5 attention as well as t5 based generation model Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>

References

#14149 - relational attention bias cuda op

Author

wangyems

Parents

8e216301

onnxruntime 5eac2c1f - relational attention bias cuda op (#14149)

onnxruntime
5eac2c1f - relational attention bias cuda op (#14149)