DeepSpeed
c292b03a - Improve parallel process of universal checkpoint conversion (#5343)

Commit

1 year ago

Improve parallel process of universal checkpoint conversion (#5343) The conversion script from a regular checkpoint to the universal one runs the followings in parallel. 1. extracts zero sharded optimizer states 2. merge the shards However, it passes `map()` a set of only a few tasks (the number specified as workers). Thus it needs to wait for the slowest tasks to finish for every set. This PR submits all the tasks to the pool and wait until the futures get ready. We can keep all workers running. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

References

#5343 - Improve parallel process of universal checkpoint conversion

Author

tohtana

Parents

9b6ef9e1

DeepSpeed c292b03a - Improve parallel process of universal checkpoint conversion (#5343)

DeepSpeed
c292b03a - Improve parallel process of universal checkpoint conversion (#5343)