onnxruntime
93d31cfd - Support softcap and softmax_precision in Attention(CUDA) (#27714)

Commit
8 days ago
Support softcap and softmax_precision in Attention(CUDA) (#27714) Fix #27712 This pull request improves support and validation for the `softcap` and `softmax_precision` attributes in the CUDA Attention operator, updates kernel eligibility and fallback logic, and enhances test coverage for these features. The changes ensure that only valid values are accepted, propagate new parameters to eligible kernels, and clarify backend capabilities in code comments and tests. **CUDA Attention operator improvements:** * Added validation to enforce that `softcap` is non-negative and that `softmax_precision` is one of the supported TensorProto types (0, 1, 10, or 16). * Updated code comments and eligibility checks to clarify that `softcap` is now supported natively in Flash and Memory Efficient Attention (MEA) kernels, and that `softmax_precision` is inherently satisfied (always computed in FP32 on CUDA). [[1]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffL174-R183) [[2]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffL548-R556) [[3]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffL824-R834) * Propagated the `softcap` parameter to the MEA kernel invocation to enable native support. [[1]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffR696) [[2]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffR746) * Modified fallback and rejection logic: unfused attention now explicitly rejects `softcap` with a clear error message, while `softmax_precision` is always considered satisfied. [[1]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffL1096-R1110) [[2]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffR1179-R1186) **Testing improvements:** * Added a new test to verify that `softmax_precision=1` (FLOAT) produces identical results to the default, since all CUDA backends compute softmax in FP32. * Clarified in existing softcap-related tests that certain configurations are not supported by CUDA unfused attention and require Flash or MEA; updated test comments for clarity. [[1]](diffhunk://#diff-3ff6dfa2ce407ae0073009174c37d1756509e8bbc434dee7c44cd55a996bb777R1088-R1089) [[2]](diffhunk://#diff-3ff6dfa2ce407ae0073009174c37d1756509e8bbc434dee7c44cd55a996bb777R1118-R1119) * Expanded Python test cases for GQA (grouped-query attention) to include nonzero `softcap` values, increasing coverage of this feature. [[1]](diffhunk://#diff-8795174e6967f83c53fcd5de6d7bfe55782a1ae05cf720378b33b7a7c4cee7dcL613-R613) [[2]](diffhunk://#diff-8795174e6967f83c53fcd5de6d7bfe55782a1ae05cf720378b33b7a7c4cee7dcL648-R648) --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Author
Parents
Loading