Add optional router_weights input to QMoE for separate selection/aggregation routing (#27687)
### Description
Adds optional input `router_weights` (index 14) to `com.microsoft.QMoE`
to decouple Top-K expert selection from output aggregation weighting.
When `router_weights` is provided:
- `router_probs` → Top-K expert selection only
- `router_weights` → values gathered at selected expert indices used as
mixing weights
When omitted, existing softmax-of-`router_probs` behavior is preserved
(backward compatible).
**Changes:**
- **Schema** (`contrib_defs.cc`): New optional input 14
`router_weights`, type T, shape `(num_tokens, num_experts)`
- **CPU provider** (`moe_quantization_cpu.cc`): Implements the separate
routing path with MLFloat16/float support and optional
`normalize_routing_weights` normalization
- **CUDA provider** (`moe_quantization.cc`): Reads input, enforces
not-implemented if provided
- **WebGPU provider** (`qmoe.cc`): Same not-implemented guard
- **Tests** (`moe_test.cc`): `QMoETest_CPU_RouterWeights` covering both
normalized and unnormalized paths with non-zero expected outputs via FC2
bias to validate correct aggregation weights
- **Docs** (`OperatorKernels.md`): Updated CPU and CUDA entries
This pattern matches DeepSeek-V2/V3/R1 routing where `sigmoid(logits)`
is used for aggregation while `logits + bias` with group masking drives
selection:
```python
# DeepSeek-style: different tensors for selection vs aggregation
topk_indices = torch.topk(scores_for_choice, k=top_k)[1] # selection from modified logits
topk_weights = router_logits.gather(1, topk_indices) # aggregation from original sigmoid
```
### Motivation and Context
`QMoE` previously required the same tensor for both routing and
weighting, blocking DeepSeek-style `noaux_tc` MoE models where these are
intentionally separate. This unblocks ONNX Runtime export/serving of
DeepSeek-V2/V3/R1 MoE architectures.
<!-- START COPILOT ORIGINAL PROMPT -->
<details>
<summary>Original prompt</summary>
>
> ----
>
> *This section details on the original issue you should resolve*
>
> <issue_title>[Feature Request] Support noaux_tc MoE routing in
com.microsoft.QMoE via separate router_weights</issue_title>
> <issue_description>### Describe the feature request
>
> `com.microsoft.QMoE` currently accepts a single routing tensor
(commonly router_probs) that is used both for:
>
> Top‑K expert selection (routing / dispatch), and
> Weighting the outputs of selected experts (aggregation).
>
> This design makes it impossible to represent DeepSeek‑style `noaux_tc`
`MoE` routing, where different tensors are intentionally used for:
>
> * expert selection (Top‑K routing), and
> * expert output weighting (mixing).
>
> This issue proposes adding an optional input `router_weights` to
`com.microsoft.QMoE` so that:
>
> * `router_probs` is used only for Top‑K selection, and
> * `router_weights` is used only for multiplying / aggregating expert
outputs.
>
> The change is backward compatible
> This also allows for any other methodology in future where different
tensors are used for selection/aggregation
>
> ### Describe scenario use case
>
> Enables exporting and serving DeepSeek‑V2/V3/R1‑style MoE models in
ONNX Runtime</issue_description>
>
> <agent_instructions>Please update operator spec and implement it in
CPU provider. For CUDA provider, it is fine to throw not implemented
exception for now.
>
> Example Deepseek MoE script can be found in
https://github.com/huggingface/transformers/blob/75c836b7853cb65f48ab2ce13cddfb12d14ecf5a/src/transformers/models/deepseek_v3/modular_deepseek_v3.py
like the following:
>
> class DeepseekV3MoE(nn.Module):
> """
> A mixed expert module containing shared experts.
> """
>
> def __init__(self, config):
> super().__init__()
> self.config = config
> self.experts = DeepseekV3NaiveMoe(config)
> self.gate = DeepseekV3TopkRouter(config)
> self.shared_experts = DeepseekV3MLP(
> config=config, intermediate_size=config.moe_intermediate_size *
config.n_shared_experts
> )
> self.n_routed_experts = config.n_routed_experts
> self.n_group = config.n_group
> self.topk_group = config.topk_group
> self.norm_topk_prob = config.norm_topk_prob
> self.routed_scaling_factor = config.routed_scaling_factor
> self.top_k = config.num_experts_per_tok
>
> def route_tokens_to_experts(self, router_logits):
> router_logits = router_logits.sigmoid()
> router_logits_for_choice = router_logits +
self.gate.e_score_correction_bias
> group_scores = (
> router_logits_for_choice.view(-1, self.n_group, self.n_routed_experts
// self.n_group)
> .topk(2, dim=-1)[0]
> .sum(dim=-1)
> )
> group_idx = torch.topk(group_scores, k=self.topk_group, dim=-1,
sorted=False)[1]
> group_mask = torch.zeros_like(group_scores)
> group_mask.scatter_(1, group_idx, 1)
> score_mask = (
> group_mask.unsqueeze(-1)
> .expand(-1, self.n_group, self.n_routed_experts // self.n_group)
> .reshape(-1, self.n_routed_experts)
> )
> scores_for_choice =
router_logits_for_choice.masked_fill(~score_mask.bool(), 0.0)
> topk_indices = torch.topk(scores_for_choice, k=self.top_k, dim=-1,
sorted=False)[1]
> topk_weights = router_logits.gather(1, topk_indices)
> if self.norm_topk_prob:
> denominator = topk_weights.sum(dim=-1, keepdim=True) + 1e-20
> topk_weights /= denominator
> topk_weights = topk_weights * self.routed_scaling_factor
> return topk_indices, topk_weights
>
> def forward(self, hidden_states):
> residuals = hidden_states
> orig_shape = hidden_states.shape
> router_logits = self.gate(hidden_states)
> topk_indices, topk_weights =
self.route_tokens_to_experts(router_logits)
> hidden_states = hidden_states.view(-1, hidden_states.shape[-1])
> hidden_states = self.experts(hidden_states, topk_indices,
topk_weights).view(*orig_shape)
> hidden_states = hidden_states + self.shared_experts(residuals)
> return hidden_states
>
> </agent_instructions>
>
> ## Comments on the Issue (you are @copilot in this section)
>
> <comments>
> </comments>
>
</details>
<!-- START COPILOT CODING AGENT SUFFIX -->
- Fixes microsoft/onnxruntime#27675
<!-- START COPILOT CODING AGENT TIPS -->
---
💡 You can make Copilot smarter by setting up custom instructions,
customizing its development environment and configuring Model Context
Protocol (MCP) servers. Learn more [Copilot coding agent
tips](https://gh.io/copilot-coding-agent-tips) in the docs.
---------
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>