onnxruntime
cd5f91fe - [CPU] GQA supports head_sink input for smooth softmax (#25269)

Commit

160 days ago

[CPU] GQA supports head_sink input for smooth softmax (#25269) ### Description It is an extension of [Smooth Softmax](https://github.com/microsoft/onnxruntime/pull/21867) feature. The difference is that each head has a learnable smooth factor that adding to the denominator of softmax. The smooth factor is like an extra element that joins the softmax. The usage of the smooth factor in softmax is like the following: ```math softmax_{i} = \frac{exp(x_{i})}{exp(s)+ \sum_{j} exp(x_{j})} ``` The head_sink is a float tensor with length of number of attention heads. For h-th head, `head_sink[h]` is used as smooth factor s. When head_sink is not provided, constant 0 is used as smooth factor s. Changes: - [x] Update operator spec to add an optional new input `head_sink` - [x] Implement CPU (MLAS) kernel. - [x] Update test_gqa_cpu.py to test it. CUDA kernel will be updated later in a separate PR.

References

#25269 - [CPU] GQA supports head_sink input for smooth softmax

Author

tianleiwu

Parents

b49fc62e

onnxruntime cd5f91fe - [CPU] GQA supports head_sink input for smooth softmax (#25269)

onnxruntime
cd5f91fe - [CPU] GQA supports head_sink input for smooth softmax (#25269)