[Cuda Plugin] Refactor CUDA ops — Move shared CPU/CUDA helper code from .cc to headers (#27617)
## Description
This PR refactors several CPU operator helper functions by moving their
implementations from `.cc` files into `.h` headers, using the `#ifdef
SHARED_PROVIDER` / `#else` inline pattern. This is a prerequisite for
the **CUDA Plugin EP** work, where CUDA kernels are built into a
standalone shared library (`libonnxruntime_providers_cuda_plugin.so`)
that cannot link against the CPU provider's `.cc` object files.
### Why This Refactoring Is Needed
The CUDA Plugin EP compiles CUDA operator kernels into a separate shared
library that communicates with the ORT core through the ORT EP Plugin
API. In this architecture, kernel source files **cannot** depend on
framework-internal symbols that live in the CPU provider static library
(`libonnxruntime_providers.a`). Many CUDA kernels inherit from CPU base
classes and call shared helper/validation methods (e.g.,
`SliceBase::PrepareForCompute`, `SplitBase::PrepareForCompute`,
`ScatterND::ValidateShapes`, `TileOp::IsTileMemcpy`,
`PadBase::ComputePads`) whose implementations currently live in CPU
`.cc` files.
In the in-tree CUDA EP build (`SHARED_PROVIDER` mode), these helpers are
accessed through the `ProviderHostCPU` DLL-boundary virtual table
bridge. However, the plugin EP does not use this bridge — it uses EP API
adapters and force-included headers instead. To make these helpers
available in the plugin build without duplicating code, this PR moves
the implementations into headers as `inline` functions under `#ifndef
SHARED_PROVIDER` guards. The `SHARED_PROVIDER` (in-tree) build path
retains the existing declaration-only signatures that route through
`ProviderHostCPU`.
This pattern has already been successfully applied to other operators
(e.g., `Einsum`). This PR extends it to the remaining operators that
need it.
## Summary of Changes
### Helper functions moved from `.cc` to `.h` (inline under `#ifndef
SHARED_PROVIDER`)
| Operator | File | Functions Moved |
|----------|------|-----------------|
| **Slice** | `cpu/tensor/slice.h` | `SliceBase::FlattenOutputDims`,
`SliceBase::PrepareForCompute` (both overloads),
`SliceBase::FillVectorsFromInput`, `slice_detail::CopyInputData<T>` |
| **Split** | `cpu/tensor/split.h` | `SplitBase::PrepareForCompute` |
| **ScatterND** | `cpu/tensor/scatter_nd.h` |
`ScatterND::ValidateShapes` |
| **Tile** | `cpu/tensor/tile.h` | `TileOp::IsTileMemcpy` |
| **Pad** | `cpu/tensor/padbase.h` | `PadBase::ComputePadsImpl` (new
template method replacing `ComputePads` for cross-context compatibility)
|
| **BiasGelu** | `contrib_ops/cpu/bert/bias_gelu_helper.h` |
`bias_gelu_helper::CheckInputs` (templatized on context type) |
| **EmbedLayerNorm** | `contrib_ops/cpu/bert/embed_layer_norm_helper.h`
| `embed_layer_norm::CheckInputs` (templatized on context type) |
| **NonMaxSuppression** | `cpu/object_detection/non_max_suppression.h` +
new `non_max_suppression_helper.h` | `NonMaxSuppressionBase` refactored
into `NonMaxSuppressionBaseImpl<KernelInfoType, KernelContextType>`
template for plugin compatibility |
### Deleted `.cc` files (implementations moved to headers)
- `contrib_ops/cpu/bert/bias_gelu_helper.cc`
- `contrib_ops/cpu/bert/embed_layer_norm_helper.cc`
### Provider bridge additions
- Added `Tensor::DataAsSpan<int32_t>()` support through the shared
provider interface (`provider_interfaces.h`, `provider_wrappedtypes.h`,
`provider_bridge_ort.cc`). This was needed because
`slice_detail::CopyInputData<int32_t>` calls
`Tensor::DataAsSpan<int32_t>()`, which was not previously bridged.
### CUDA-side updates
- `cuda/tensor/slice.h`: Updated `Slice` constructor to use the new
`SliceBase(info, dynamic, 0)` overload (template-based constructor
compatible with both adapter and real `OpKernelInfo`).
- `cuda/tensor/pad.cc`: Updated call from `PadBase::ComputePads` to
`PadBase::ComputePadsImpl`.
- `cuda/tensor/scatter_nd.cc`: Templatized
`InitializeElementCountsAndInputDimsSpanOrGpu` on `KernelContextType`
(also fixed typo: `InitiliazeElement...` → `InitializeElement...`).
- `cuda/object_detection/non_max_suppression.h`: Updated to use
`NonMaxSuppressionBaseImpl<OpKernelInfo, OpKernelContext>` instead of
`NonMaxSuppressionBase`.
### New file
- `cpu/object_detection/non_max_suppression_helper.h`: Contains the
template-based `NonMaxSuppressionBaseImpl` class, separating it from the
CPU-specific `NonMaxSuppression` kernel registration.
## Testing
- Existing unit tests cover all affected operators (Slice, Split,
ScatterND, Tile, Pad, BiasGelu, EmbedLayerNorm, NonMaxSuppression).
- No behavioral changes — all function logic is identical; only the
location (header vs. source) and linkage (inline vs. external) changed.
- The `SHARED_PROVIDER` code path (in-tree CUDA EP build) is unchanged —
declarations remain and route through the existing `ProviderHostCPU`
bridge.
## Motivation and Context
This is part of the ongoing CUDA Plugin EP effort to build CUDA kernels
as a standalone shared library that can be updated independently of the
ORT core. The refactoring enables ~10 additional CUDA operators to
compile in the plugin build by making their CPU-side validation and
preparation helpers available as header-inline functions.