[Zero++ qgZ] Fall back to reduce_scatter if `tensor.numel() % (2 * global_world_size) != 0` (#5056)
## Why?
See https://github.com/microsoft/DeepSpeed/issues/5054. The actual rule
here is that qgZ doesn't work if `tensor.numel() % (2 *
global_world_size) != 0`. I will explain later. It may usually happen
when tensor size is odd or global_world_size is odd
## What?
1. Fall back to reduce scatter if tensor.numel() % (2 *
global_world_size) != 0 because all-to-all will have size mismatch
error.
2. Add logging when falling back to inform users qgZ is not taking
effect
3. Add a test for the fallback cases. non-fallback all-to-all cannot be
tested because we don't support multinode testing for now.
## Analysis?
In,
[all_to_all_quant_reduce](https://github.com/microsoft/DeepSpeed/blob/93e9537d4ccf0e54042ce98a910dcbc125bb8485/deepspeed/runtime/comm/coalesced_collectives.py#L31),
1. The initial tensor is of size (dim_1, dim_2, ... dim_n), and numel is
A (=dim_1*dim_2...*dim_n).
2. After swizzle_quant, `intra_quant_int4` is of size (A // 2) if A % 2
== 0. the tensor is quantized from `fp16/bf16` to `int4`. However,
`intra_quant_int4` is actually represented by `int8`, which means every
two int4 tensors is grouped into one and stored in `int8` format. Note
that if A % 2 != 0, the quantization can still process, but the size
differ by cases. I cannot find the underlying rule)
3. At `all_to_all_single(local_output, intra_quant_int4,
group=groups[f'local_{intra_idx}'])`, we should assert that
`intra_quant_int4 % local_world_size == 0`, which means (A // 2) %
`local_world_size` == 0.
4. At quantized_reduction, `intra_quant_int4` is chunked to
`local_world_size` pieces and reduce them together.
5. global_input_tensor is of size A // (2 * `local_world_size`) after
reduction
6. At `all_to_all_single(global_output, global_input_tensor,
group=groups[f'global_{inter_idx}'])`, we should assert that
`global_input_tensor % n_nodes == 0`, which means `( A // (2 *
local_world_size) ) % n_nodes == 0`
We can conclude that if `A % (2 * global_world_size) == 0`, then the
above step can run safely. Otherwise, unexpected things may happen (size
mismatch, cuda bad address, etc). Therefore, to be safe, we only use qgZ
if the condition satisfied.
@GuanhuaWang and me also discussed about adding paddings before
all-to-all, but it will have correctness issues and might involve cuda
level changes. Thus, the best solution now is to fallback.
## End-to-end test
1. Tested with dummy nn 3 nodes 2 gpus each (fallback)
Ended normally but with fallback warning in the middle
```
[1] LOSS: 3.6015625
[0] LOSS: 3.6015625
[2024-02-02 00:46:10,692] [WARNING] [coalesced_collectives.py:52:all_to_all_quant_reduce] gqZ falls back to reduce_scatter because tensor size = 1024 is not divisible by (2 * global_world_size) = 12. Please consider allocating a new world to enable gqZ
[2024-02-02 00:46:10,692] [WARNING] [coalesced_collectives.py:52:all_to_all_quant_reduce] gqZ falls back to reduce_scatter because tensor size = 1024 is not divisible by (2 * global_world_size) = 12. Please consider allocating a new world to enable gqZ
```
3. Tested with dummy nn 2 nodes 2 gpus each (non-fallback)
Ended normally and no warning.
```
[1] LOSS: 3.544921875
[0] LOSS: 3.58984375
[2024-02-02 00:37:05,388] [INFO] [logging.py:96:log_dist] [Rank 0] step=32, skipped=0, lr=[0.00015], mom=[(0.9, 0.999)]
[2024-02-02 00:37:05,388] [INFO] [timer.py:260:stop] epoch=0/micro_step=32/global_step=32, RunningAvgSamplesPerSec=53506.35861091424, CurrSamplesPerSec=54696.98861519354, MemAllocated=0.02GB, MaxMemAllocated=0.02GB
[1] LOSS: 3.431640625
[0] LOSS: 3.572265625
[2024-02-02 00:37:05,393] [INFO] [logging.py:96:log_dist] [Rank 0] step=33, skipped=0, lr=[0.00015], mom=[(0.9, 0.999)]
[2024-02-02 00:37:05,393] [INFO] [timer.py:260:stop] epoch=0/micro_step=33/global_step=33, RunningAvgSamplesPerSec=53531.621993677356, CurrSamplesPerSec=54300.77616234266, MemAllocated=0.02GB, MaxMemAllocated=0.02GB
```
4. Tested with llama-7b 3 nodes 2 gpus each (fallback)
Ended normally but with fallback warning in the middle
```
size) = 12. Please consider allocating a new world to enable gqZ
[2024-02-02 00:52:04,819] [WARNING] [coalesced_collectives.py:52:all_to_all_quant_reduce] gqZ falls back to reduce_scatter because tensor size = 45088768 is not divisible by (2 * global_world_size) = 12. Please consider allocating a new world to enable gqZ
[2024-02-02 00:52:04,820] [WARNING] [coalesced_collectives.py:52:all_to_all_quant_reduce] gqZ falls back to reduce_scatter because tensor size = 45088768 is not divisible by (2 * global_world_size) = 12. Please consider allocating a new world to enable gqZ
```
6. Tested with llama-7b 2 nodes 2 gpus each (non-fallback)
Ended normally and no warning.
```
0%| | 1/6241 [01:26<150:33:40, 86.86s/it]tried to get lr value before scheduler/optimizer started stepping, returning lr=0
tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 2.4093, 'learning_rate': 0, 'epoch': 0.0, 'max_steps': 6241, 'global_step': 1, 'current_step_time_seconds': 88.56739020347595, 'average_step_time_seconds': 88.56739115715027, 'estimated_time_to_completion_seconds': 552660.5208206177, 'estimated_total_training_time_seconds': 552749.0882117748}
0%| | 2/6241 [02:19<115:32:25, 66.67s/it]tried to get lr value before scheduler/optimizer started stepping, returning lr=0
tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 2.6963, 'learning_rate': 0, 'epoch': 0.0, 'max_steps': 6241, 'global_step': 2, 'current_step_time_seconds': 52.532806396484375, 'average_step_time_seconds': 70.55009877681732, 'estimated_time_to_completion_seconds': 440162.06626856327, 'estimated_total_training_time_seconds': 440303.1664661169}
0%|
```
## Unit test
TestAllToAllQuantReduceFallback
---------
Signed-off-by: byhsu <byhsu@linkedin.com>
Co-authored-by: byhsu <byhsu@linkedin.com>