Add LabelEncoder CUDA execution provider for numeric types (#28045)
### Description
Implements `ai.onnx.ml.LabelEncoder` on the CUDA execution provider for
numeric key/value types using sorted arrays + binary search (O(log n)
per element).
**New files** (`onnxruntime/core/providers/cuda/ml/`):
- `label_encoder_impl.cu` / `.h` — CUDA kernel: per-thread binary search
on sorted keys, NaN-aware for float/double
- `label_encoder.cc` / `.h` — Host-side op classes (`CudaLabelEncoder`
for opset 2-3, `CudaLabelEncoder_4` for opset 4+). Constructor sorts
keys, copies to GPU; `ComputeInternal` launches kernel.
**Modified files**:
- `cuda_execution_provider.cc` — Register 11 kernel variants (4
versioned opset 2-3, 7 opset 4+)
- `provider_api.h` — Add missing `kMLDomain` constant (first ML-domain
op on CUDA EP)
- `docs/OperatorKernels.md` — Add `ai.onnx.ml` section to CUDA provider
table
**Supported type combinations**:
| Opset | Types |
|-------|-------|
| 2-3 | `int64↔float`, `int64↔int64`, `float↔float` |
| 4+ | Above + `double↔double`, `double↔int64`, `int64↔double` |
String types remain CPU-only. NaN keys are placed at end of sorted array
and short-circuited before binary search.
**Tests**: 5 new test cases covering NaN-key-to-numeric-value mappings
and double type combinations. Existing numeric tests
(`FloatToInt64Opset2`, `Int64ToFloatOpset2`, etc.) will automatically
run on CUDA via `OpTester::Run()`.
### Motivation and Context
Models with large LabelEncoder nodes (>100k entries) force a CPU
round-trip when all other nodes run on GPU. This adds the CUDA
implementation to eliminate that data transfer bottleneck.
---------
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>