DeepSpeed
66d29b0a - Graceful exit on failures for multi-node runs (#2008)

Commit
3 years ago
Graceful exit on failures for multi-node runs (#2008) * Use Popen.terminate() to stop the child processes gracefully; Kill them if terminate doesn't work * The Popen.kill() command cause the training processes to end abruptly. This may cause the child processes to become zombies without communicating properly to the parent process about the kill signal. So the ssh session continue to wait for signals from the child processes, causing it to not return back to the pdsh command Fixes microsoft#1995 Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Author
Jerry Mannil
Parents
Loading