Put Muon optimizer momentum buffer on GPU (#7648)
This PR put Muon optimizer momentum buffer on GPU. This makes Muon
optimizer executes much faster (finetune Qwen2.5-3B on 2xA100 cards,
iteration time 1500ms --> 910ms). Previously this buffer is on CPU.
---------
Signed-off-by: Guokai Ma <guokai.ma@intel.com>