onnxruntime
dbf95cfb - [CUDA] Optimize QMoE SoftmaxTopK router for small-batch decode (#28980)

Commit

26 days ago

[CUDA] Optimize QMoE SoftmaxTopK router for small-batch decode (#28980) ### Description Optimizes the CUDA QMoE router top-k (`LaunchSoftmaxTopK`) for small-batch / autoregressive decode by replacing the old one-thread-per-row hot path with parallel CUB and warp-level top-k kernels. The dispatch now uses the fastest specialized path for common MoE expert counts while preserving the existing softmax normalization and deterministic lower-index tie-breaking semantics. This PR also factors the warp-level top-k sorting code into a reusable CUDA helper header and adds direct CUDA-internal tests so the new routing paths are covered independently of higher-level QMoE tests. ### Motivation and Context The previous router path launched a 256-thread block per row but did all top-k work in a single thread. In decode scenarios such as `num_rows == 1`, that made the router latency-bound on a serial scan of all expert logits and turned `SoftmaxTopKKernel` into a major MoE decode bottleneck. For a Qwen3-style MoE workload with 256 experts, top-8 routing, and 40 MoE layers, the original router accounted for roughly 50% of decode GPU time. Moving the work to block/warp-parallel kernels removes that bottleneck while keeping the same output ordering and scaling behavior. ### Key Changes | Area | Change | |---|---| | QMoE router dispatch | Adds `DispatchSoftmaxTopK` routing for `k <= 64` and `num_experts <= 1024`, with a fallback to the original scalar kernel for larger or uncommon shapes. | | Tiny expert counts | Adds `SoftmaxTopKWarpBitonicKernel` for `num_experts <= 32`, using one warp per row and in-register bitonic sorting via warp shuffles. | | Small expert counts | Adds `SoftmaxTopKWarpMergeKernel` for `32 < num_experts <= 64`, using a single warp and CUB warp merge sort. | | Larger common MoE counts | Uses `SoftmaxTopKMergeKernel` with CUB block merge sort for `num_experts <= 128`, `256`, `512`, and `1024`. | | Reusable top-k helpers | Adds `onnxruntime/core/providers/cuda/cu_inc/topk_warp_sort.cuh` with reusable warp bitonic and warp merge sort helpers. | | Stable tie-breaking | Packs `(score, index)` into a `uint64_t` stable sort key for the CUB merge paths, matching onnxruntime-genai's lower-index tie-breaking and avoiding compound comparators. | | Softmax cleanup | Factors shared softmax scale, safe reciprocal, top-k normalization, warp reduction, and CUB block reduction helpers to keep the optimized kernels consistent. | | Tests | Adds CUDA-internal `SoftmaxTopK_*` tests covering warp bitonic, warp merge, block merge, stable ties, normalization, `float`, `half`, and `bfloat16`. | ### Performance H200 measurements for the target QMoE decode scenario showed the router cost dropping from roughly `5.56 ms/token` to `0.17 ms/token`, improving end-to-end Qwen3.6-35B-A3B INT4 decode throughput from about `80 tok/s` to `113 tok/s`. Additional profiling of the `32 < num_experts <= 64` warp merge path showed the packed `uint64_t` stable sort key is consistently faster than a `{float, int}` struct comparator on H200: | Experts | Sort-only packed/struct | Full softmax+top-k packed/struct | |---:|---:|---:| | 33 | 0.680x | 0.704x | | 48 | 0.672x | 0.695x | | 64 | 0.673x | 0.696x | ### Testing - `lintrunner -a` - `ninja onnxruntime_providers_cuda_ut` - `ninja onnxruntime_provider_test` - `GTEST_FILTER='CUDA_EP_Unittest.SoftmaxTopK_*' ./onnxruntime_provider_test --gtest_filter='CUDA_EP_Unittest.All'` - `onnxruntime/test/python/transformers/test_qmoe_cuda.py -k parity` (`44 passed`)

References

#28980 - [CUDA] Optimize QMoE SoftmaxTopK router for small-batch decode

Author

tianleiwu

Parents

b823aecc

onnxruntime dbf95cfb - [CUDA] Optimize QMoE SoftmaxTopK router for small-batch decode (#28980)

onnxruntime
dbf95cfb - [CUDA] Optimize QMoE SoftmaxTopK router for small-batch decode (#28980)