DeepSpeed
592325ab - [Zero++ qgZ] Fall back to reduce_scatter if `tensor.numel() % (2 * global_world_size) != 0` (#5056)

Comment changes are shownComment changes are hidden
Commit
1 year ago
[Zero++ qgZ] Fall back to reduce_scatter if `tensor.numel() % (2 * global_world_size) != 0` (#5056) ## Why? See https://github.com/microsoft/DeepSpeed/issues/5054. The actual rule here is that qgZ doesn't work if `tensor.numel() % (2 * global_world_size) != 0`. I will explain later. It may usually happen when tensor size is odd or global_world_size is odd ## What? 1. Fall back to reduce scatter if tensor.numel() % (2 * global_world_size) != 0 because all-to-all will have size mismatch error. 2. Add logging when falling back to inform users qgZ is not taking effect 3. Add a test for the fallback cases. non-fallback all-to-all cannot be tested because we don't support multinode testing for now. ## Analysis? In, [all_to_all_quant_reduce](https://github.com/microsoft/DeepSpeed/blob/93e9537d4ccf0e54042ce98a910dcbc125bb8485/deepspeed/runtime/comm/coalesced_collectives.py#L31), 1. The initial tensor is of size (dim_1, dim_2, ... dim_n), and numel is A (=dim_1*dim_2...*dim_n). 2. After swizzle_quant, `intra_quant_int4` is of size (A // 2) if A % 2 == 0. the tensor is quantized from `fp16/bf16` to `int4`. However, `intra_quant_int4` is actually represented by `int8`, which means every two int4 tensors is grouped into one and stored in `int8` format. Note that if A % 2 != 0, the quantization can still process, but the size differ by cases. I cannot find the underlying rule) 3. At `all_to_all_single(local_output, intra_quant_int4, group=groups[f'local_{intra_idx}'])`, we should assert that `intra_quant_int4 % local_world_size == 0`, which means (A // 2) % `local_world_size` == 0. 4. At quantized_reduction, `intra_quant_int4` is chunked to `local_world_size` pieces and reduce them together. 5. global_input_tensor is of size A // (2 * `local_world_size`) after reduction 6. At `all_to_all_single(global_output, global_input_tensor, group=groups[f'global_{inter_idx}'])`, we should assert that `global_input_tensor % n_nodes == 0`, which means `( A // (2 * local_world_size) ) % n_nodes == 0` We can conclude that if `A % (2 * global_world_size) == 0`, then the above step can run safely. Otherwise, unexpected things may happen (size mismatch, cuda bad address, etc). Therefore, to be safe, we only use qgZ if the condition satisfied. @GuanhuaWang and me also discussed about adding paddings before all-to-all, but it will have correctness issues and might involve cuda level changes. Thus, the best solution now is to fallback. ## End-to-end test 1. Tested with dummy nn 3 nodes 2 gpus each (fallback) Ended normally but with fallback warning in the middle ``` [1] LOSS: 3.6015625 [0] LOSS: 3.6015625 [2024-02-02 00:46:10,692] [WARNING] [coalesced_collectives.py:52:all_to_all_quant_reduce] gqZ falls back to reduce_scatter because tensor size = 1024 is not divisible by (2 * global_world_size) = 12. Please consider allocating a new world to enable gqZ [2024-02-02 00:46:10,692] [WARNING] [coalesced_collectives.py:52:all_to_all_quant_reduce] gqZ falls back to reduce_scatter because tensor size = 1024 is not divisible by (2 * global_world_size) = 12. Please consider allocating a new world to enable gqZ ``` 3. Tested with dummy nn 2 nodes 2 gpus each (non-fallback) Ended normally and no warning. ``` [1] LOSS: 3.544921875 [0] LOSS: 3.58984375 [2024-02-02 00:37:05,388] [INFO] [logging.py:96:log_dist] [Rank 0] step=32, skipped=0, lr=[0.00015], mom=[(0.9, 0.999)] [2024-02-02 00:37:05,388] [INFO] [timer.py:260:stop] epoch=0/micro_step=32/global_step=32, RunningAvgSamplesPerSec=53506.35861091424, CurrSamplesPerSec=54696.98861519354, MemAllocated=0.02GB, MaxMemAllocated=0.02GB [1] LOSS: 3.431640625 [0] LOSS: 3.572265625 [2024-02-02 00:37:05,393] [INFO] [logging.py:96:log_dist] [Rank 0] step=33, skipped=0, lr=[0.00015], mom=[(0.9, 0.999)] [2024-02-02 00:37:05,393] [INFO] [timer.py:260:stop] epoch=0/micro_step=33/global_step=33, RunningAvgSamplesPerSec=53531.621993677356, CurrSamplesPerSec=54300.77616234266, MemAllocated=0.02GB, MaxMemAllocated=0.02GB ``` 4. Tested with llama-7b 3 nodes 2 gpus each (fallback) Ended normally but with fallback warning in the middle ``` size) = 12. Please consider allocating a new world to enable gqZ [2024-02-02 00:52:04,819] [WARNING] [coalesced_collectives.py:52:all_to_all_quant_reduce] gqZ falls back to reduce_scatter because tensor size = 45088768 is not divisible by (2 * global_world_size) = 12. Please consider allocating a new world to enable gqZ [2024-02-02 00:52:04,820] [WARNING] [coalesced_collectives.py:52:all_to_all_quant_reduce] gqZ falls back to reduce_scatter because tensor size = 45088768 is not divisible by (2 * global_world_size) = 12. Please consider allocating a new world to enable gqZ ``` 6. Tested with llama-7b 2 nodes 2 gpus each (non-fallback) Ended normally and no warning. ``` 0%| | 1/6241 [01:26<150:33:40, 86.86s/it]tried to get lr value before scheduler/optimizer started stepping, returning lr=0 tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 2.4093, 'learning_rate': 0, 'epoch': 0.0, 'max_steps': 6241, 'global_step': 1, 'current_step_time_seconds': 88.56739020347595, 'average_step_time_seconds': 88.56739115715027, 'estimated_time_to_completion_seconds': 552660.5208206177, 'estimated_total_training_time_seconds': 552749.0882117748} 0%| | 2/6241 [02:19<115:32:25, 66.67s/it]tried to get lr value before scheduler/optimizer started stepping, returning lr=0 tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 2.6963, 'learning_rate': 0, 'epoch': 0.0, 'max_steps': 6241, 'global_step': 2, 'current_step_time_seconds': 52.532806396484375, 'average_step_time_seconds': 70.55009877681732, 'estimated_time_to_completion_seconds': 440162.06626856327, 'estimated_total_training_time_seconds': 440303.1664661169} 0%| ``` ## Unit test TestAllToAllQuantReduceFallback --------- Signed-off-by: byhsu <byhsu@linkedin.com> Co-authored-by: byhsu <byhsu@linkedin.com>
Author
Parents
  • deepspeed/runtime/comm
    • File
      coalesced_collectives.py
  • tests/unit/runtime/comm
    • File
      test_coalesced_collectives.py