CUDA Plugin EP: NHWC Cleanup & Hardening (#28612)
## Summary
Unifies the NHWC-eligible op allowlist between the bundled CUDA EP and
the CUDA plugin EP into a single shared header, adds kernel-miss
diagnostics, and expands NHWC test coverage from 4 ops to 11.
## Motivation
The bundled EP (`cuda_execution_provider.cc`) and the plugin EP
(`plugin/cuda_ep.cc`) independently maintained their own copies of the
NHWC allowlist. This created a maintenance hazard where ops could be
added to one but not the other, leading to silent divergence.
Additionally, there was no runtime diagnostic when the framework rewrote
a node to the NHWC domain but the plugin EP lacked a matching kernel —
failures were silent fallbacks to CPU.
## Key Changes
### Shared NHWC Allowlist (`cuda_nhwc_ops.h`)
| Item | Detail |
|------|--------|
| New file | `onnxruntime/core/providers/cuda/cuda_nhwc_ops.h` |
| Contents | `IsNhwcEligibleOnnxOp()`, `IsNhwcEligibleMsOp()`,
`IsNhwcEligible()` inline functions |
| Ops covered | AveragePool, BatchNormalization, Conv, ConvTranspose,
DepthToSpace, GlobalAveragePool, GlobalMaxPool, GridSample, LRN,
MaxPool, SpaceToDepth (+ MS-domain GridSample) |
### Bundled EP Refactor (`cuda_execution_provider.cc`)
- Removed the static `std::unordered_set<std::string_view>
cuda_nhwc_onnx_ops` and the inline domain check logic.
- Replaced with a single call to `cuda::IsNhwcEligible(node_domain,
node_op_type)`.
### Plugin EP Refactor & Diagnostics (`plugin/cuda_ep.cc`)
- `ShouldConvertDataLayoutForOpImpl`: Replaced ~20 lines of static set +
domain checks with a single `cuda::IsNhwcEligible()` call.
- `GetCapabilityImpl`: Added a WARNING-level diagnostic in the `else`
branch (kernel not found). When a node in the `com.ms.internal.nhwc`
domain has no registered kernel, the log emits the op type, domain,
version, and node name — making future NHWC registration gaps
immediately visible at session creation.
### Expanded NHWC Test Coverage (`test_cuda_plugin_ep.py`)
- Added `_assert_nhwc_domain_assigned()` helper that verifies NHWC
layout transformation occurred by checking for framework-inserted
Transpose nodes in the EP's assignment info.
- Added `_run_nhwc_model_test()` helper combining domain assertion +
numerical validation.
- Updated 4 existing NHWC tests (Conv, BatchNormalization, MaxPool,
AveragePool) to include structural assertions.
- Added 7 new NHWC test methods:
- `test_nhwc_conv_transpose`
- `test_nhwc_global_max_pool`
- `test_nhwc_global_average_pool`
- `test_nhwc_depth_to_space`
- `test_nhwc_space_to_depth`
- `test_nhwc_lrn`
- `test_nhwc_grid_sample`
## Testing Notes
Run the full CUDA plugin EP test suite with NHWC enabled:
```bash
bash .env/cuda13_plugin.sh --build --install --test_plugin
```
Or run only the NHWC tests directly:
```bash
cd onnxruntime/test/python/transformers
ORT_TEST_CUDA_PLUGIN_EP=1 python -m unittest \
test_cuda_plugin_ep.TestCudaPluginEP.test_nhwc_conv \
test_cuda_plugin_ep.TestCudaPluginEP.test_nhwc_batch_normalization \
test_cuda_plugin_ep.TestCudaPluginEP.test_nhwc_maxpool \
test_cuda_plugin_ep.TestCudaPluginEP.test_nhwc_avgpool \
test_cuda_plugin_ep.TestCudaPluginEP.test_nhwc_conv_transpose \
test_cuda_plugin_ep.TestCudaPluginEP.test_nhwc_global_max_pool \
test_cuda_plugin_ep.TestCudaPluginEP.test_nhwc_global_average_pool \
test_cuda_plugin_ep.TestCudaPluginEP.test_nhwc_depth_to_space \
test_cuda_plugin_ep.TestCudaPluginEP.test_nhwc_space_to_depth \
test_cuda_plugin_ep.TestCudaPluginEP.test_nhwc_lrn \
test_cuda_plugin_ep.TestCudaPluginEP.test_nhwc_grid_sample
```
All 86 tests in the suite pass (11 NHWC + 75 existing), with no
regressions.