DeepSpeed
95be499a - Method `run_map_reduce` to fix errors when running `run_map` followed by `run_reduce` (#5131)

Commit
1 year ago
Method `run_map_reduce` to fix errors when running `run_map` followed by `run_reduce` (#5131) In the map-reduce in data analysis, the run_reduce will merge several files into one. There are two open issues: - when running `run_map` followed by `run_reduce`, the `run_reduce` may start before all nodes finished the `run_map`, leading to having nodes loading files that are not populated/flush (zero-sized error) - when running `run_reduce`, all nodes are loading the partial result files output by all nodes, and all nodes will write the same files that result from the merge. This leads to strange IO errors. `run_reduce` should only be run by one node, and all nodes should wait for `run_reduce` to finish before they feed the dataset to `deepspeed.initialize()`. This PR fixes both these issues when running `run_map` followed by `run_reduce`, by providing a method `run_map_reduce` that fixes this logic. It adds `dist.barrier`s to both steps (where barrier runs on an user-specifed `comm_group`), and non-master running the reduce operation. **NOTE:** an alternative workaround is by not providing the `run_map_reduce` method in this PR but having the correct barriers and safeguards added to `run_map` and `run_reduce` as in [5130](https://github.com/microsoft/DeepSpeed/pull/5130). --------- Co-authored-by: Conglong Li <conglong.li@gmail.com>
Author
Parents
Loading