onnxruntime
5bba3573 - Add optional router_weights input to QMoE for separate selection/aggregation routing (#27687)

Commit
43 days ago
Add optional router_weights input to QMoE for separate selection/aggregation routing (#27687) ### Description Adds optional input `router_weights` (index 14) to `com.microsoft.QMoE` to decouple Top-K expert selection from output aggregation weighting. When `router_weights` is provided: - `router_probs` → Top-K expert selection only - `router_weights` → values gathered at selected expert indices used as mixing weights When omitted, existing softmax-of-`router_probs` behavior is preserved (backward compatible). **Changes:** - **Schema** (`contrib_defs.cc`): New optional input 14 `router_weights`, type T, shape `(num_tokens, num_experts)` - **CPU provider** (`moe_quantization_cpu.cc`): Implements the separate routing path with MLFloat16/float support and optional `normalize_routing_weights` normalization - **CUDA provider** (`moe_quantization.cc`): Reads input, enforces not-implemented if provided - **WebGPU provider** (`qmoe.cc`): Same not-implemented guard - **Tests** (`moe_test.cc`): `QMoETest_CPU_RouterWeights` covering both normalized and unnormalized paths with non-zero expected outputs via FC2 bias to validate correct aggregation weights - **Docs** (`OperatorKernels.md`): Updated CPU and CUDA entries This pattern matches DeepSeek-V2/V3/R1 routing where `sigmoid(logits)` is used for aggregation while `logits + bias` with group masking drives selection: ```python # DeepSeek-style: different tensors for selection vs aggregation topk_indices = torch.topk(scores_for_choice, k=top_k)[1] # selection from modified logits topk_weights = router_logits.gather(1, topk_indices) # aggregation from original sigmoid ``` ### Motivation and Context `QMoE` previously required the same tensor for both routing and weighting, blocking DeepSeek-style `noaux_tc` MoE models where these are intentionally separate. This unblocks ONNX Runtime export/serving of DeepSeek-V2/V3/R1 MoE architectures. <!-- START COPILOT ORIGINAL PROMPT --> <details> <summary>Original prompt</summary> > > ---- > > *This section details on the original issue you should resolve* > > <issue_title>[Feature Request] Support noaux_tc MoE routing in com.microsoft.QMoE via separate router_weights</issue_title> > <issue_description>### Describe the feature request > > `com.microsoft.QMoE` currently accepts a single routing tensor (commonly router_probs) that is used both for: > > Top‑K expert selection (routing / dispatch), and > Weighting the outputs of selected experts (aggregation). > > This design makes it impossible to represent DeepSeek‑style `noaux_tc` `MoE` routing, where different tensors are intentionally used for: > > * expert selection (Top‑K routing), and > * expert output weighting (mixing). > > This issue proposes adding an optional input `router_weights` to `com.microsoft.QMoE` so that: > > * `router_probs` is used only for Top‑K selection, and > * `router_weights` is used only for multiplying / aggregating expert outputs. > > The change is backward compatible > This also allows for any other methodology in future where different tensors are used for selection/aggregation > > ### Describe scenario use case > > Enables exporting and serving DeepSeek‑V2/V3/R1‑style MoE models in ONNX Runtime</issue_description> > > <agent_instructions>Please update operator spec and implement it in CPU provider. For CUDA provider, it is fine to throw not implemented exception for now. > > Example Deepseek MoE script can be found in https://github.com/huggingface/transformers/blob/75c836b7853cb65f48ab2ce13cddfb12d14ecf5a/src/transformers/models/deepseek_v3/modular_deepseek_v3.py like the following: > > class DeepseekV3MoE(nn.Module): > """ > A mixed expert module containing shared experts. > """ > > def __init__(self, config): > super().__init__() > self.config = config > self.experts = DeepseekV3NaiveMoe(config) > self.gate = DeepseekV3TopkRouter(config) > self.shared_experts = DeepseekV3MLP( > config=config, intermediate_size=config.moe_intermediate_size * config.n_shared_experts > ) > self.n_routed_experts = config.n_routed_experts > self.n_group = config.n_group > self.topk_group = config.topk_group > self.norm_topk_prob = config.norm_topk_prob > self.routed_scaling_factor = config.routed_scaling_factor > self.top_k = config.num_experts_per_tok > > def route_tokens_to_experts(self, router_logits): > router_logits = router_logits.sigmoid() > router_logits_for_choice = router_logits + self.gate.e_score_correction_bias > group_scores = ( > router_logits_for_choice.view(-1, self.n_group, self.n_routed_experts // self.n_group) > .topk(2, dim=-1)[0] > .sum(dim=-1) > ) > group_idx = torch.topk(group_scores, k=self.topk_group, dim=-1, sorted=False)[1] > group_mask = torch.zeros_like(group_scores) > group_mask.scatter_(1, group_idx, 1) > score_mask = ( > group_mask.unsqueeze(-1) > .expand(-1, self.n_group, self.n_routed_experts // self.n_group) > .reshape(-1, self.n_routed_experts) > ) > scores_for_choice = router_logits_for_choice.masked_fill(~score_mask.bool(), 0.0) > topk_indices = torch.topk(scores_for_choice, k=self.top_k, dim=-1, sorted=False)[1] > topk_weights = router_logits.gather(1, topk_indices) > if self.norm_topk_prob: > denominator = topk_weights.sum(dim=-1, keepdim=True) + 1e-20 > topk_weights /= denominator > topk_weights = topk_weights * self.routed_scaling_factor > return topk_indices, topk_weights > > def forward(self, hidden_states): > residuals = hidden_states > orig_shape = hidden_states.shape > router_logits = self.gate(hidden_states) > topk_indices, topk_weights = self.route_tokens_to_experts(router_logits) > hidden_states = hidden_states.view(-1, hidden_states.shape[-1]) > hidden_states = self.experts(hidden_states, topk_indices, topk_weights).view(*orig_shape) > hidden_states = hidden_states + self.shared_experts(residuals) > return hidden_states > > </agent_instructions> > > ## Comments on the Issue (you are @copilot in this section) > > <comments> > </comments> > </details> <!-- START COPILOT CODING AGENT SUFFIX --> - Fixes microsoft/onnxruntime#27675 <!-- START COPILOT CODING AGENT TIPS --> --- 💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more [Copilot coding agent tips](https://gh.io/copilot-coding-agent-tips) in the docs. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com> Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Author
Parents
Loading