onnxruntime
0d0f436c - Fix CUDA EP: add opset 24 kernel registrations for Reshape and Cast

Commit

83 days ago

Fix CUDA EP: add opset 24 kernel registrations for Reshape and Cast ONNX opset 24 bumped Reshape (added float8e8m0 type) and Cast (added float8e8m0 type). ORT's CUDA EP only had registrations up to opset 23, causing these ops to fall back to CPUExecutionProvider on opset 24 models. This produced ~280 MemcpyFromHost/MemcpyToHost nodes that cascaded through the entire attention pipeline. Fix: Version existing opset 23 registrations to (23, 23) and add new non-versioned opset 24 registrations for both Reshape and Cast. The kernel implementations are unchanged — only the registration metadata is updated. Also fix CUTLASS FMHA BiasLoader alignment: use kAlignmentA instead of hardcoded 128-bit loads so the unaligned kernel path works with bias strides that are multiples of 4 elements (not 8). Also fix MEA dispatch: skip MEA when head_size != v_head_size in GQA mode (LaunchUngroup and LaunchConcatNewToPastKV require matching dims). Result: 282 memcpy → 4 memcpy for Gemma4 opset 24 CUDA EP model. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <justinchu@microsoft.com>

References

fix-attention-head-size-mismatch

Author

justinchuby

Committer

justinchuby

Parents

a1aa3bbf

onnxruntime 0d0f436c - Fix CUDA EP: add opset 24 kernel registrations for Reshape and Cast

onnxruntime
0d0f436c - Fix CUDA EP: add opset 24 kernel registrations for Reshape and Cast