Fix CUDA EP: add opset 24 kernel registrations + CUTLASS BiasLoader alignment
Two fixes for CUDA EP models using ONNX opset 24:
1. Add opset 24 CUDA kernel registrations for Reshape and Cast.
ONNX opset 24 bumped these ops (added float8e8m0 type support).
Without opset 24 registrations, these ops fall to CPUExecutionProvider,
producing ~280 MemcpyFromHost/MemcpyToHost nodes that cascade through
the entire model. Version existing opset 23 registrations to (23, 23)
and add new non-versioned opset 24 registrations. Same kernel code.
Result: 282 memcpy -> 4 memcpy for opset 24 models.
2. Fix CUTLASS FMHA BiasLoader vectorized load alignment.
BiasLoader hardcoded 128-bit (8 fp16 element) vectorized loads via
`ElementsPerAccess = 128 / sizeof_bits<scalar_t>` regardless of the
isAligned template parameter. When the attention bias stride
(total_sequence_length) was not a multiple of 8 elements, the
unaligned kernel was selected but still used 128-bit loads on the
bias, causing cudaErrorMisalignedAddress.
Fix: Use kAlignmentA (which is kMinimumAlignment=4 for the unaligned
path, kAlignmentA=8 for the aligned path) as BiasLoader's
ElementsPerAccess. This allows the unaligned kernel to use 64-bit
loads for the bias while the aligned kernel continues with 128-bit.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Justin Chu <justinchu@microsoft.com>