onnxruntime
201e2407 - [Cuda Plugin] Refactor CUDA ops — Move shared CPU/CUDA helper code from .cc to headers (#27617)

Commit

122 days ago

[Cuda Plugin] Refactor CUDA ops — Move shared CPU/CUDA helper code from .cc to headers (#27617) ## Description This PR refactors several CPU operator helper functions by moving their implementations from `.cc` files into `.h` headers, using the `#ifdef SHARED_PROVIDER` / `#else` inline pattern. This is a prerequisite for the **CUDA Plugin EP** work, where CUDA kernels are built into a standalone shared library (`libonnxruntime_providers_cuda_plugin.so`) that cannot link against the CPU provider's `.cc` object files. ### Why This Refactoring Is Needed The CUDA Plugin EP compiles CUDA operator kernels into a separate shared library that communicates with the ORT core through the ORT EP Plugin API. In this architecture, kernel source files **cannot** depend on framework-internal symbols that live in the CPU provider static library (`libonnxruntime_providers.a`). Many CUDA kernels inherit from CPU base classes and call shared helper/validation methods (e.g., `SliceBase::PrepareForCompute`, `SplitBase::PrepareForCompute`, `ScatterND::ValidateShapes`, `TileOp::IsTileMemcpy`, `PadBase::ComputePads`) whose implementations currently live in CPU `.cc` files. In the in-tree CUDA EP build (`SHARED_PROVIDER` mode), these helpers are accessed through the `ProviderHostCPU` DLL-boundary virtual table bridge. However, the plugin EP does not use this bridge — it uses EP API adapters and force-included headers instead. To make these helpers available in the plugin build without duplicating code, this PR moves the implementations into headers as `inline` functions under `#ifndef SHARED_PROVIDER` guards. The `SHARED_PROVIDER` (in-tree) build path retains the existing declaration-only signatures that route through `ProviderHostCPU`. This pattern has already been successfully applied to other operators (e.g., `Einsum`). This PR extends it to the remaining operators that need it. ## Summary of Changes ### Helper functions moved from `.cc` to `.h` (inline under `#ifndef SHARED_PROVIDER`) | Operator | File | Functions Moved | |----------|------|-----------------| | **Slice** | `cpu/tensor/slice.h` | `SliceBase::FlattenOutputDims`, `SliceBase::PrepareForCompute` (both overloads), `SliceBase::FillVectorsFromInput`, `slice_detail::CopyInputData<T>` | | **Split** | `cpu/tensor/split.h` | `SplitBase::PrepareForCompute` | | **ScatterND** | `cpu/tensor/scatter_nd.h` | `ScatterND::ValidateShapes` | | **Tile** | `cpu/tensor/tile.h` | `TileOp::IsTileMemcpy` | | **Pad** | `cpu/tensor/padbase.h` | `PadBase::ComputePadsImpl` (new template method replacing `ComputePads` for cross-context compatibility) | | **BiasGelu** | `contrib_ops/cpu/bert/bias_gelu_helper.h` | `bias_gelu_helper::CheckInputs` (templatized on context type) | | **EmbedLayerNorm** | `contrib_ops/cpu/bert/embed_layer_norm_helper.h` | `embed_layer_norm::CheckInputs` (templatized on context type) | | **NonMaxSuppression** | `cpu/object_detection/non_max_suppression.h` + new `non_max_suppression_helper.h` | `NonMaxSuppressionBase` refactored into `NonMaxSuppressionBaseImpl<KernelInfoType, KernelContextType>` template for plugin compatibility | ### Deleted `.cc` files (implementations moved to headers) - `contrib_ops/cpu/bert/bias_gelu_helper.cc` - `contrib_ops/cpu/bert/embed_layer_norm_helper.cc` ### Provider bridge additions - Added `Tensor::DataAsSpan<int32_t>()` support through the shared provider interface (`provider_interfaces.h`, `provider_wrappedtypes.h`, `provider_bridge_ort.cc`). This was needed because `slice_detail::CopyInputData<int32_t>` calls `Tensor::DataAsSpan<int32_t>()`, which was not previously bridged. ### CUDA-side updates - `cuda/tensor/slice.h`: Updated `Slice` constructor to use the new `SliceBase(info, dynamic, 0)` overload (template-based constructor compatible with both adapter and real `OpKernelInfo`). - `cuda/tensor/pad.cc`: Updated call from `PadBase::ComputePads` to `PadBase::ComputePadsImpl`. - `cuda/tensor/scatter_nd.cc`: Templatized `InitializeElementCountsAndInputDimsSpanOrGpu` on `KernelContextType` (also fixed typo: `InitiliazeElement...` → `InitializeElement...`). - `cuda/object_detection/non_max_suppression.h`: Updated to use `NonMaxSuppressionBaseImpl<OpKernelInfo, OpKernelContext>` instead of `NonMaxSuppressionBase`. ### New file - `cpu/object_detection/non_max_suppression_helper.h`: Contains the template-based `NonMaxSuppressionBaseImpl` class, separating it from the CPU-specific `NonMaxSuppression` kernel registration. ## Testing - Existing unit tests cover all affected operators (Slice, Split, ScatterND, Tile, Pad, BiasGelu, EmbedLayerNorm, NonMaxSuppression). - No behavioral changes — all function logic is identical; only the location (header vs. source) and linkage (inline vs. external) changed. - The `SHARED_PROVIDER` code path (in-tree CUDA EP build) is unchanged — declarations remain and route through the existing `ProviderHostCPU` bridge. ## Motivation and Context This is part of the ongoing CUDA Plugin EP effort to build CUDA kernels as a standalone shared library that can be updated independently of the ORT core. The refactoring enables ~10 additional CUDA operators to compile in the plugin build by making their CPU-side validation and preparation helpers available as header-inline functions.

References

#27617 - [Cuda Plugin] Refactor CUDA ops — Move shared CPU/CUDA helper code from .cc to headers

Author

tianleiwu

Parents

2b8176cf

onnxruntime 201e2407 - [Cuda Plugin] Refactor CUDA ops — Move shared CPU/CUDA helper code from .cc to headers (#27617)

onnxruntime
201e2407 - [Cuda Plugin] Refactor CUDA ops — Move shared CPU/CUDA helper code from .cc to headers (#27617)