Method `run_map_reduce` to fix errors when running `run_map` followed by `run_reduce` (#5131)
In the map-reduce in data analysis, the run_reduce will merge several
files into one. There are two open issues:
- when running `run_map` followed by `run_reduce`, the `run_reduce` may
start before all nodes finished the `run_map`, leading to having nodes
loading files that are not populated/flush (zero-sized error)
- when running `run_reduce`, all nodes are loading the partial result
files output by all nodes, and all nodes will write the same files that
result from the merge. This leads to strange IO errors. `run_reduce`
should only be run by one node, and all nodes should wait for
`run_reduce` to finish before they feed the dataset to
`deepspeed.initialize()`.
This PR fixes both these issues when running `run_map` followed by
`run_reduce`, by providing a method `run_map_reduce` that fixes this
logic. It adds `dist.barrier`s to both steps (where barrier runs on an
user-specifed `comm_group`), and non-master running the reduce
operation.
**NOTE:** an alternative workaround is by not providing the
`run_map_reduce` method in this PR but having the correct barriers and
safeguards added to `run_map` and `run_reduce` as in
[5130](https://github.com/microsoft/DeepSpeed/pull/5130).
---------
Co-authored-by: Conglong Li <conglong.li@gmail.com>