DeepSpeed
a7864846 - Avoid zero-sized microbatches for incomplete minibatches when doing curriculum learning (#5118)

Commit

2 years ago

Avoid zero-sized microbatches for incomplete minibatches when doing curriculum learning (#5118) Related to curriculum learning and the data efficiency module. The `get_start_end_idx()` method that is used to compute which batch indices to allocate across data parallel ranks, assumes the batch to be of size `micro-batch size * data_parallel_size` and allocates sequential subsets of indices across data loader processes. When `drop_last=False`, then the global batch size will very likely be smaller than `micro-batch size * data_parallel_size`, and `get_start_end_idx()` will give a full `self.microbatch_size` sized batch to a few initial nodes and the remaining ones will have a zero-sized microbatch. This leads to load imbalance and (probably) wrong updates as gradients are averaged across different microbatch sizes. This PR fixes that by distributing the same amount (+-1 sample) across all data loader ranks, when batch is not complete. --------- Co-authored-by: Conglong Li <conglong.li@gmail.com>

References

#5118 - Avoid zero-sized microbatches for incomplete minibatches when doing curriculum learning

Author

bm-synth

Parents

2b411103

DeepSpeed a7864846 - Avoid zero-sized microbatches for incomplete minibatches when doing curriculum learning (#5118)

DeepSpeed
a7864846 - Avoid zero-sized microbatches for incomplete minibatches when doing curriculum learning (#5118)