pytorch
ca03f362 - Change ProcessGroupNCCL default timeout to 10 min (#110947)

Commit
1 year ago
Change ProcessGroupNCCL default timeout to 10 min (#110947) Avoid changing default for other backends as CPU backend (GLOO) may need longer timeouts. Motivated by trying to save cluster time when encountering collective hangs. Generally collectives should time out within seconds and 30 minutes (or 10 minutes) should provide ample headroom for edge cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110947 Approved by: https://github.com/xw285cornell, https://github.com/fduwjj
Author
Committer
Parents
Loading