onnxruntime
c7250f4d - [CPU] GQA supports attention scores output (#25319)

Commit
155 days ago
[CPU] GQA supports attention scores output (#25319) ### Description 1. Add optional output to CPU impl of GQA op for storing attention scores (QK). Buffer is of shape (B, N, S, T) and can either be fp16 or fp32, depending on the type of other inputs 2. Add `qk_output` attribute to GQA, which controls if attention scores should be saved before or after softmax is applied 3. Add unit tests to cover this use case 4. Added asserts on other EPs if this feature is used
Parents
Loading