Fix GroupQueryAttention right-padded rotary prefill CUDA test (#29218)
### Description
The `GroupQueryAttentionTest.BatchedRightPaddedRotaryPrefill_CUDA` test
(added in #29002) fed **fp32** inputs via `AddInput<float>`. The CUDA
(and WebGPU) GroupQueryAttention kernels only register for
`MLFloat16`/`BFloat16`, so the fp32 node silently fell back to the **CPU
EP** — the `_CUDA` test never actually exercised the CUDA kernel it is
named for. This surfaced as a CI failure on the CUDA test leg after
#29002 and #29046 merged.
This PR makes `RunGQAPackedQKVRotaryPrefill` feed **fp16** tensors when
targeting CUDA EP, matching the existing `RunGQASharedKVFp16` convention
and the test's own "loose enough for fp16 rounding" tolerance. The CPU
code path is unchanged.
### Key Changes
- `RunGQAPackedQKVRotaryPrefill` now branches on the target EP:
- CUDA EP: inputs/outputs use `MLFloat16` (converted via `ToFloat16`),
so the node is placed on the real GPU kernel.
- WebGPU/CPU EP: unchanged (`float`).
- Output is converted back to `float` for the existing comparison logic.
### Testing
- `onnxruntime_provider_test
--gtest_filter='GroupQueryAttentionTest.BatchedRightPaddedRotaryPrefill_CUDA'`
→ **PASSED** (now runs on the CUDA fp16 kernel).
- Full `GroupQueryAttentionTest.*` suite → 47 passed, WebGPU-only tests
skipped locally (no WebGPU EP), no regressions.
### Motivation and Context
Restores genuine CUDA kernel coverage for the right-padded rotary
prefill scenario and fixes the CI failure. Related: #29002, #29046.