Fix int32 overflow in CUDA Cast and UnaryElementWise kernels for tensors with >2^31 elements (#28386)
- [x] Fix `unary_elementwise_impl.cuh`: Change `CUDA_LONG` to `int64_t`
for `N` parameter and loop index in `_UnaryElementWise` kernel, and fix
`blocksPerGrid` calculation
- [x] Fix `cast_op.cu`: Change `CUDA_LONG` to `int64_t` for `N`
parameter and loop index in `CastKernelStd`, `CastKernelSat`, and
`CudaCastPairwiseKernel` kernels, and remove `static_cast<int>`
truncation
- [x] Use `size_t` for `pair_count` in CudaCastPairwise to avoid double
conversion (review feedback)
- [x] Rename test to `CastKernelCorrectness_ModerateSize` and add
`CastKernel_Int64IndexArithmetic_NoOverflow` host-side test (review
feedback)
- [x] Merge from main to resolve conflicts with Float8E8M0 tests
---------
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
Co-authored-by: justinchuby <11205048+justinchuby@users.noreply.github.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>