pytorch
96ba2099 - Fix c10d TCP store with mutex (#68499)

Commit
2 years ago
Fix c10d TCP store with mutex (#68499) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68499 TCP store is actually being accessed by multi-threading (NCCL watch dog thread), but no mutex protection while FileStore and HashStore have. As enabling desync root cause analysis makes store calls more often, the race condition of TCP store was always triggered when creating another process group like gloo. Adding mutex to TCP store, to be the same with FileStore and HashStore. Test Plan: DDP benchmark with desync debug enabled, no perf regression https://www.internalfb.com/intern/fblearner/details/309398285?tab=Outputs W/o this diff https://www.internalfb.com/intern/fblearner/details/308379789?tab=Outputs Reviewed By: mingzhe09088 Differential Revision: D32482254 fbshipit-source-id: e8f466e1c6fdcab6cfa170f44b9be70395935fb8
Author
Parents
Loading