onnxruntime
93cc52b1 - Fix CUDA EP: add opset 24 kernel registrations + CUTLASS BiasLoader alignment

Commit

71 days ago

Fix CUDA EP: add opset 24 kernel registrations + CUTLASS BiasLoader alignment Two fixes for CUDA EP models using ONNX opset 24: 1. Add opset 24 CUDA kernel registrations for Reshape and Cast. ONNX opset 24 bumped these ops (added float8e8m0 type support). Without opset 24 registrations, these ops fall to CPUExecutionProvider, producing ~280 MemcpyFromHost/MemcpyToHost nodes that cascade through the entire model. Version existing opset 23 registrations to (23, 23) and add new non-versioned opset 24 registrations. Same kernel code. Result: 282 memcpy -> 4 memcpy for opset 24 models. 2. Fix CUTLASS FMHA BiasLoader vectorized load alignment. BiasLoader hardcoded 128-bit (8 fp16 element) vectorized loads via `ElementsPerAccess = 128 / sizeof_bits<scalar_t>` regardless of the isAligned template parameter. When the attention bias stride (total_sequence_length) was not a multiple of 8 elements, the unaligned kernel was selected but still used 128-bit loads on the bias, causing cudaErrorMisalignedAddress. Fix: Use kAlignmentA (which is kMinimumAlignment=4 for the unaligned path, kAlignmentA=8 for the aligned path) as BiasLoader's ElementsPerAccess. This allows the unaligned kernel to use 64-bit loads for the bias while the aligned kernel continues with 128-bit. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <justinchu@microsoft.com>

References

fix-cuda-opset24-registrations

Author

justinchuby

Parents

07b8f395

onnxruntime 93cc52b1 - Fix CUDA EP: add opset 24 kernel registrations + CUTLASS BiasLoader alignment

onnxruntime
93cc52b1 - Fix CUDA EP: add opset 24 kernel registrations + CUTLASS BiasLoader alignment