Support softcap and softmax_precision in Attention(CUDA) (#27714)
Fix #27712
This pull request improves support and validation for the `softcap` and
`softmax_precision` attributes in the CUDA Attention operator, updates
kernel eligibility and fallback logic, and enhances test coverage for
these features. The changes ensure that only valid values are accepted,
propagate new parameters to eligible kernels, and clarify backend
capabilities in code comments and tests.
**CUDA Attention operator improvements:**
* Added validation to enforce that `softcap` is non-negative and that
`softmax_precision` is one of the supported TensorProto types (0, 1, 10,
or 16).
* Updated code comments and eligibility checks to clarify that `softcap`
is now supported natively in Flash and Memory Efficient Attention (MEA)
kernels, and that `softmax_precision` is inherently satisfied (always
computed in FP32 on CUDA).
[[1]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffL174-R183)
[[2]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffL548-R556)
[[3]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffL824-R834)
* Propagated the `softcap` parameter to the MEA kernel invocation to
enable native support.
[[1]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffR696)
[[2]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffR746)
* Modified fallback and rejection logic: unfused attention now
explicitly rejects `softcap` with a clear error message, while
`softmax_precision` is always considered satisfied.
[[1]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffL1096-R1110)
[[2]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffR1179-R1186)
**Testing improvements:**
* Added a new test to verify that `softmax_precision=1` (FLOAT) produces
identical results to the default, since all CUDA backends compute
softmax in FP32.
* Clarified in existing softcap-related tests that certain
configurations are not supported by CUDA unfused attention and require
Flash or MEA; updated test comments for clarity.
[[1]](diffhunk://#diff-3ff6dfa2ce407ae0073009174c37d1756509e8bbc434dee7c44cd55a996bb777R1088-R1089)
[[2]](diffhunk://#diff-3ff6dfa2ce407ae0073009174c37d1756509e8bbc434dee7c44cd55a996bb777R1118-R1119)
* Expanded Python test cases for GQA (grouped-query attention) to
include nonzero `softcap` values, increasing coverage of this feature.
[[1]](diffhunk://#diff-8795174e6967f83c53fcd5de6d7bfe55782a1ae05cf720378b33b7a7c4cee7dcL613-R613)
[[2]](diffhunk://#diff-8795174e6967f83c53fcd5de6d7bfe55782a1ae05cf720378b33b7a7c4cee7dcL648-R648)
---------
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>