DeepSpeed
8a77f381 - Add Gram Newton-Schulz orthogonalization for Muon optimizer (#7953)

Commit
62 days ago
Add Gram Newton-Schulz orthogonalization for Muon optimizer (#7953) Author: @delock and @PKUWZP ## Summary Integrate Gram Newton-Schulz (Gram NS) as the default orthogonalization method for the Muon optimizer, with a configurable `ns_method` switch to fall back to the original iteration when needed. Based on the Gram Newton-Schulz method from https://tridao.me/blog/2026/gram-newton-schulz/ ## Motivation Standard Newton-Schulz iterates on the full rectangular matrix X (n × m). Gram NS iterates on the much smaller Gram matrix R = X @ X.T (n × n), which is significantly cheaper when m >> n — the common case for transformer weight matrices (typical aspect ratio α ≈ 5). ## Changes - Add `zeropower_via_gram_newtonschulz` in `original_muon.py` with fp16 compute (better precision than bf16 at the same cost) and a restart at iteration 2 for half-precision stability - Add `ns_method` parameter (`"gram"` | `"standard"`) to `muon_update` and all Muon optimizer classes - Thread `ns_method` through ZeRO Stage 1/2/3 call sites and DeepSpeed JSON config - Automatic fallback to standard NS for square matrices (m ≤ n) where Gram NS has no FLOP advantage - Documentation and unit tests for both methods across ZeRO Stage 1, 2, and 3 ## Usage ```json "optimizer": { "type": "Muon", "params": { "ns_method": "gram" } } Set "ns_method": "standard" to disable Gram NS and revert to original behavior (e.g., for debugging convergence issues). ``` Performance improvement: <img width="1630" height="409" alt="image" src="https://github.com/user-attachments/assets/66364bb0-3a99-4cab-a428-10f31b7ae5fa" /> --------- Signed-off-by: Ma, Guokai <guokai.ma@gmail.com> Co-authored-by: Zhipeng Wang <zhipeng.rainbowserie@gmail.com>
Author
Parents
Loading