Fix CUDA EP: add opset 24 kernel registrations for Reshape and Cast
ONNX opset 24 bumped Reshape (added float8e8m0 type) and Cast (added
float8e8m0 type). ORT's CUDA EP only had registrations up to opset 23,
causing these ops to fall back to CPUExecutionProvider on opset 24
models. This produced ~280 MemcpyFromHost/MemcpyToHost nodes that
cascaded through the entire attention pipeline.
Fix: Version existing opset 23 registrations to (23, 23) and add new
non-versioned opset 24 registrations for both Reshape and Cast. The
kernel implementations are unchanged — only the registration metadata
is updated.
Also fix CUTLASS FMHA BiasLoader alignment: use kAlignmentA instead
of hardcoded 128-bit loads so the unaligned kernel path works with
bias strides that are multiples of 4 elements (not 8).
Also fix MEA dispatch: skip MEA when head_size != v_head_size in GQA
mode (LaunchUngroup and LaunchConcatNewToPastKV require matching dims).
Result: 282 memcpy → 4 memcpy for Gemma4 opset 24 CUDA EP model.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Justin Chu <justinchu@microsoft.com>