[CUDA] Optimize QMoE SoftmaxTopK router for small-batch decode (#28980)
### Description
Optimizes the CUDA QMoE router top-k (`LaunchSoftmaxTopK`) for
small-batch / autoregressive decode by replacing the old
one-thread-per-row hot path with parallel CUB and warp-level top-k
kernels. The dispatch now uses the fastest specialized path for common
MoE expert counts while preserving the existing softmax normalization
and deterministic lower-index tie-breaking semantics.
This PR also factors the warp-level top-k sorting code into a reusable
CUDA helper header and adds direct CUDA-internal tests so the new
routing paths are covered independently of higher-level QMoE tests.
### Motivation and Context
The previous router path launched a 256-thread block per row but did all
top-k work in a single thread. In decode scenarios such as `num_rows ==
1`, that made the router latency-bound on a serial scan of all expert
logits and turned `SoftmaxTopKKernel` into a major MoE decode
bottleneck.
For a Qwen3-style MoE workload with 256 experts, top-8 routing, and 40
MoE layers, the original router accounted for roughly 50% of decode GPU
time. Moving the work to block/warp-parallel kernels removes that
bottleneck while keeping the same output ordering and scaling behavior.
### Key Changes
| Area | Change |
|---|---|
| QMoE router dispatch | Adds `DispatchSoftmaxTopK` routing for `k <=
64` and `num_experts <= 1024`, with a fallback to the original scalar
kernel for larger or uncommon shapes. |
| Tiny expert counts | Adds `SoftmaxTopKWarpBitonicKernel` for
`num_experts <= 32`, using one warp per row and in-register bitonic
sorting via warp shuffles. |
| Small expert counts | Adds `SoftmaxTopKWarpMergeKernel` for `32 <
num_experts <= 64`, using a single warp and CUB warp merge sort. |
| Larger common MoE counts | Uses `SoftmaxTopKMergeKernel` with CUB
block merge sort for `num_experts <= 128`, `256`, `512`, and `1024`. |
| Reusable top-k helpers | Adds
`onnxruntime/core/providers/cuda/cu_inc/topk_warp_sort.cuh` with
reusable warp bitonic and warp merge sort helpers. |
| Stable tie-breaking | Packs `(score, index)` into a `uint64_t` stable
sort key for the CUB merge paths, matching onnxruntime-genai's
lower-index tie-breaking and avoiding compound comparators. |
| Softmax cleanup | Factors shared softmax scale, safe reciprocal, top-k
normalization, warp reduction, and CUB block reduction helpers to keep
the optimized kernels consistent. |
| Tests | Adds CUDA-internal `SoftmaxTopK_*` tests covering warp
bitonic, warp merge, block merge, stable ties, normalization, `float`,
`half`, and `bfloat16`. |
### Performance
H200 measurements for the target QMoE decode scenario showed the router
cost dropping from roughly `5.56 ms/token` to `0.17 ms/token`, improving
end-to-end Qwen3.6-35B-A3B INT4 decode throughput from about `80 tok/s`
to `113 tok/s`.
Additional profiling of the `32 < num_experts <= 64` warp merge path
showed the packed `uint64_t` stable sort key is consistently faster than
a `{float, int}` struct comparator on H200:
| Experts | Sort-only packed/struct | Full softmax+top-k packed/struct |
|---:|---:|---:|
| 33 | 0.680x | 0.704x |
| 48 | 0.672x | 0.695x |
| 64 | 0.673x | 0.696x |
### Testing
- `lintrunner -a`
- `ninja onnxruntime_providers_cuda_ut`
- `ninja onnxruntime_provider_test`
- `GTEST_FILTER='CUDA_EP_Unittest.SoftmaxTopK_*'
./onnxruntime_provider_test --gtest_filter='CUDA_EP_Unittest.All'`
- `onnxruntime/test/python/transformers/test_qmoe_cuda.py -k parity`
(`44 passed`)