DeepSpeed
a6cf6b69 - Fold transpose into matmul in Gram NS for tall matrices

Commit

10 days ago

Fold transpose into matmul in Gram NS for tall matrices Replace (Q @ X).mT.contiguous() with X.mT @ Q.mT which produces a contiguous result directly. cuBLAS handles transposed inputs natively via transpose flags, so the matmul cost is identical but the extra memcpy from .contiguous() is eliminated. Benchmark (Qwen2.5-3B, 2xA100, ZeRO-2, 3 runs avg): Before: 936.6ms/step (backward: 628.4ms) After: 931.5ms/step (backward: 612.8ms) Speedup vs standard NS: 11.2% -> 11.7% Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

References

#7953 - Add Gram Newton-Schulz orthogonalization for Muon optimizer

Author

delock

Parents

e5de42ce

DeepSpeed a6cf6b69 - Fold transpose into matmul in Gram NS for tall matrices

DeepSpeed
a6cf6b69 - Fold transpose into matmul in Gram NS for tall matrices