DeepSpeed
67b365af - No Muon optimizer for embeding and lm_head layer (#7641)

Commit

156 days ago

No Muon optimizer for embeding and lm_head layer (#7641) This PR follow the suggestion in this artical https://kellerjordan.github.io/posts/muon/#empirical-considerations that non-hidden layers ('embedding' and 'lm_head') needs to be excluded from Muon optimizer. It check parameter name for `embed` and `lm_head` and not apply `'use_muon'` attribute if any of these string present in the name. Note in nanochat it also take the same approach (put embedding and lm_head in `adam_groups` to avoid using Muon) https://github.com/karpathy/nanochat/blob/2e938530ce7f38d51052b4e5b37cf5613d0a45fb/nanochat/gpt.py#L226 , so this should be a common practice. Signed-off-by: Guokai Ma <guokai.ma@intel.com>

References

#7641 - No Muon optimizer for embeding and lm_head layer

Author

delock

Parents

7af561c2

DeepSpeed 67b365af - No Muon optimizer for embeding and lm_head layer (#7641)

DeepSpeed
67b365af - No Muon optimizer for embeding and lm_head layer (#7641)