Fix: only add parameter with grads to parameter group (#7869)
This PR fix a bug when Muon optimizer is used on training part of the
model parameters.
When train part of the model parameters (and freeze all others). In
certain case, all trainable paramters will use Muon optimizer and non of
them use AdamW optimizer, or vice versa. It will cause one of
`muon_params` and `non_muon_params` to contain only non-trainable
parameters, which would eventurally cause the following failure.
A reasonable fix is only add parameter with grads to `muon_params` and
`non_muon_params`, so the case above would cause one of the parameter
groups to be empty and get filterd out immediately.
```
[rank3]: Traceback (most recent call last):
[rank3]: File "/home/gma/transfer_qwen/finetune_moe.py", line 904, in <module>
[rank3]: main(args)
[rank3]: File "/home/gma/transfer_qwen/finetune_moe.py", line 709, in main
[rank3]: model_engine, optimizer, train_dataloader, lr_scheduler = deepspeed.initialize(
[rank3]: ^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/home/gma/DeepSpeed/deepspeed/__init__.py", line 214, in initialize
[rank3]: engine = DeepSpeedEngine(args=args,
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/home/gma/DeepSpeed/deepspeed/runtime/engine.py", line 363, in __init__
[rank3]: self._configure_optimizer(optimizer, model_parameters)
[rank3]: File "/home/gma/DeepSpeed/deepspeed/runtime/engine.py", line 1585, in _configure_optimizer
[rank3]: self.optimizer = self._configure_zero_optimizer(basic_optimizer)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/home/gma/DeepSpeed/deepspeed/runtime/engine.py", line 1893, in _configure_zero_optimizer
[rank3]: optimizer = Stage1And2ZeroOptimizer(
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/home/gma/DeepSpeed/deepspeed/runtime/zero/stage_1_and_2.py", line 403, in __init__
[rank3]: flattened_buffer = self.flatten_dense_tensors_aligned(
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/home/gma/DeepSpeed/deepspeed/runtime/zero/stage_1_and_2.py", line 1044, in flatten_dense_tensors_aligned
[rank3]: return self.flatten(align_dense_tensors(tensor_list, alignment))
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/raid/gma/miniforge3/envs/ds/lib/python3.12/site-packages/torch/_utils.py", line 571, in
_flatten_dense_tensors
[rank3]: return torch._C._nn.flatten_dense_tensors(tensors)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: ValueError: torch.cat(): expected a non-empty list of Tensors
```
---------
Signed-off-by: Ma, Guokai <guokai.ma@intel.com>