Skip empty parameters in gradient reduction (#7789)
#7736 fixed an issue with OnebitLamb NaN propagation. With the fix, the
optimizer correctly filters out empty parameters, but DeepSpeed engine's
gradient allreduce operation (which runs separately from the optimizer)
still includes empty parameters' gradients.
This PR addresses the issue by skipping empty parameters (numel=0) in
`_get_gradients_for_reduction()`.
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>