No Muon optimizer for embeding and lm_head layer (#7641)
This PR follow the suggestion in this artical
https://kellerjordan.github.io/posts/muon/#empirical-considerations that
non-hidden layers ('embedding' and 'lm_head') needs to be excluded from
Muon optimizer. It check parameter name for `embed` and `lm_head` and
not apply `'use_muon'` attribute if any of these string present in the
name.
Note in nanochat it also take the same approach (put embedding and
lm_head in `adam_groups` to avoid using Muon)
https://github.com/karpathy/nanochat/blob/2e938530ce7f38d51052b4e5b37cf5613d0a45fb/nanochat/gpt.py#L226
, so this should be a common practice.
Signed-off-by: Guokai Ma <guokai.ma@intel.com>