Avoid zero-sized microbatches for incomplete minibatches when doing curriculum learning (#5118)
Related to curriculum learning and the data efficiency module.
The `get_start_end_idx()` method that is used to compute which batch
indices to allocate across data parallel ranks, assumes the batch to be
of size `micro-batch size * data_parallel_size` and allocates sequential
subsets of indices across data loader processes.
When `drop_last=False`, then the global batch size will very likely be
smaller than `micro-batch size * data_parallel_size`, and
`get_start_end_idx()` will give a full `self.microbatch_size` sized
batch to a few initial nodes and the remaining ones will have a
zero-sized microbatch. This leads to load imbalance and (probably) wrong
updates as gradients are averaged across different microbatch sizes.
This PR fixes that by distributing the same amount (+-1 sample) across
all data loader ranks, when batch is not complete.
---------
Co-authored-by: Conglong Li <conglong.li@gmail.com>