Improve parallel process of universal checkpoint conversion (#5343)
The conversion script from a regular checkpoint to the universal one
runs the followings in parallel.
1. extracts zero sharded optimizer states
2. merge the shards
However, it passes `map()` a set of only a few tasks (the number
specified as workers). Thus it needs to wait for the slowest tasks to
finish for every set.
This PR submits all the tasks to the pool and wait until the futures get
ready. We can keep all workers running.
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>