DeepSpeed
2f0924a5 - Fix process hang in process-group shutdown (#7941)

Commit
5 days ago
Fix process hang in process-group shutdown (#7941) Removing the file used as the file-store while the process-group is still active is invalid as it is still in use. If `reuse_dist_env` is `True` the process group is still active and the processes will try reading from that file waiting for it to exists. In the shutdown (`destroy_process_group`) they will wait for all threads to join but (at least) one is still waiting for that file. This will cause the process to hang until a PyTorch-internal timeout is reached, which currently is ~ 5minutes Solution is to create a unique file. I chose to put it in in `tmpdir` and add a suffix to differentiate it. Note that `tmpdir` is not enough as this method is called through the fixture setup already once so that is not clean when called later in the test execution CC @mrwyattii , author of #3850 adding this code --------- Signed-off-by: Alexander Grund <alexander.grund@tu-dresden.de>
Author
Parents
Loading