onnxruntime
14a6c9e9 - Fix GroupQueryAttention right-padded rotary prefill CUDA test (#29218)

Commit
8 days ago
Fix GroupQueryAttention right-padded rotary prefill CUDA test (#29218) ### Description The `GroupQueryAttentionTest.BatchedRightPaddedRotaryPrefill_CUDA` test (added in #29002) fed **fp32** inputs via `AddInput<float>`. The CUDA (and WebGPU) GroupQueryAttention kernels only register for `MLFloat16`/`BFloat16`, so the fp32 node silently fell back to the **CPU EP** — the `_CUDA` test never actually exercised the CUDA kernel it is named for. This surfaced as a CI failure on the CUDA test leg after #29002 and #29046 merged. This PR makes `RunGQAPackedQKVRotaryPrefill` feed **fp16** tensors when targeting CUDA EP, matching the existing `RunGQASharedKVFp16` convention and the test's own "loose enough for fp16 rounding" tolerance. The CPU code path is unchanged. ### Key Changes - `RunGQAPackedQKVRotaryPrefill` now branches on the target EP: - CUDA EP: inputs/outputs use `MLFloat16` (converted via `ToFloat16`), so the node is placed on the real GPU kernel. - WebGPU/CPU EP: unchanged (`float`). - Output is converted back to `float` for the existing comparison logic. ### Testing - `onnxruntime_provider_test --gtest_filter='GroupQueryAttentionTest.BatchedRightPaddedRotaryPrefill_CUDA'` → **PASSED** (now runs on the CUDA fp16 kernel). - Full `GroupQueryAttentionTest.*` suite → 47 passed, WebGPU-only tests skipped locally (no WebGPU EP), no regressions. ### Motivation and Context Restores genuine CUDA kernel coverage for the right-padded rotary prefill scenario and fixes the CI failure. Related: #29002, #29046.
Author
Parents
Loading