[CUDA] Extend Pad support through opset 25 with wrap mode (#27774)
### Description
This PR consolidates PRs #27416 and #27708 to extend CUDA Pad kernel
support through opset 25, including wrap mode implementation.
### Motivation and Context
The CUDA execution provider previously only registered the Pad kernel up
to opset 18 and did not implement wrap mode. When an ONNX model exported
with opset 19+ was run on the CUDA executor, the Pad operation was
forced to fall back to CPU, resulting in significant performance
degradation. This PR aligns CUDA Pad registration with the ONNX Pad
schema evolution through opset 25 and provides a correct wrap mode
implementation.
Related issues: https://github.com/microsoft/onnxruntime/issues/26393
Related PRs: #27416, #27708
### Summary of Changes
#### Kernel registration and opset coverage
| File | Change |
|------|--------|
| `onnxruntime/core/providers/cuda/tensor/pad.cc` | Adds CUDA Pad kernel
registrations for opset ranges 18, 19-20, 21-22, 23, 24, and 25. |
| `onnxruntime/core/providers/cuda/cuda_execution_provider.cc` |
Registers the new Pad kernel versions in the CUDA EP registry under the
existing per-opset sections. |
#### CUDA Pad implementation
| File | Change |
|------|--------|
| `onnxruntime/core/providers/cuda/tensor/pad_impl.h` | Extends the Pad
kernel interface to pass effective sliced extents and per-axis input
offsets. |
| `onnxruntime/core/providers/cuda/tensor/pad_impl.cu` | Adds CUDA wrap
mode using a `WrapCoordinate` device helper with `if constexpr`
compile-time specialization. Removes dead wrap code from the
NCHW-specialized kernel path. |
| `onnxruntime/core/providers/cuda/tensor/pad.cc` | Computes effective
sliced input extents/offsets for wrap behavior with negative pads.
Bypasses the NCHW fast-path for wrap mode and routes through the generic
implementation. |
#### Documentation
| File | Change |
|------|--------|
| `docs/OperatorKernels.md` | Updates the CUDA Pad kernel opset coverage
to reflect the new version splits (25+, 24, 23, [21,22], [19,20], 18) up
to opset 25. |
#### Test coverage
| File | Change |
|------|--------|
| `onnxruntime/test/providers/cpu/tensor/pad_test.cc` | Adds CUDA-only
Pad coverage for `edge` across opsets 18-25 and `wrap` across opsets
19-25. Updates existing wrap test comment. |
### Checklist
- [x] Tests added/updated
- [x] No breaking changes
<!-- START COPILOT CODING AGENT TIPS -->
---
✨ Let Copilot coding agent [set things up for
you](https://github.com/microsoft/onnxruntime/issues/new?title=✨+Set+up+Copilot+instructions&body=Configure%20instructions%20for%20this%20repository%20as%20documented%20in%20%5BBest%20practices%20for%20Copilot%20coding%20agent%20in%20your%20repository%5D%28https://gh.io/copilot-coding-agent-tips%29%2E%0A%0A%3COnboard%20this%20repo%3E&assignees=copilot)
— coding agent works faster and does higher quality work when set up for
your repo.
---------
Co-authored-by: Shirasawa <764798966@qq.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>