Add Gram Newton-Schulz orthogonalization for Muon optimizer (#7953)
Author: @delock and @PKUWZP
## Summary
Integrate Gram Newton-Schulz (Gram NS) as the default orthogonalization
method for the Muon optimizer, with a configurable `ns_method` switch to
fall back to the original iteration when needed.
Based on the Gram Newton-Schulz method from
https://tridao.me/blog/2026/gram-newton-schulz/
## Motivation
Standard Newton-Schulz iterates on the full rectangular matrix X (n ×
m). Gram NS iterates on the much smaller Gram matrix R = X @ X.T (n ×
n), which is significantly cheaper when m >> n — the common case for
transformer weight matrices (typical aspect ratio α ≈ 5).
## Changes
- Add `zeropower_via_gram_newtonschulz` in `original_muon.py` with fp16
compute (better precision than bf16 at the same cost)
and a restart at iteration 2 for half-precision stability
- Add `ns_method` parameter (`"gram"` | `"standard"`) to `muon_update`
and all Muon optimizer classes
- Thread `ns_method` through ZeRO Stage 1/2/3 call sites and DeepSpeed
JSON config
- Automatic fallback to standard NS for square matrices (m ≤ n) where
Gram NS has no FLOP advantage
- Documentation and unit tests for both methods across ZeRO Stage 1, 2,
and 3
## Usage
```json
"optimizer": {
"type": "Muon",
"params": {
"ns_method": "gram"
}
}
Set "ns_method": "standard" to disable Gram NS and revert to original behavior (e.g., for debugging convergence issues).
```
Performance improvement:
<img width="1630" height="409" alt="image"
src="https://github.com/user-attachments/assets/66364bb0-3a99-4cab-a428-10f31b7ae5fa"
/>
---------
Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>
Co-authored-by: Zhipeng Wang <zhipeng.rainbowserie@gmail.com>