cuda nccl usage safe for nccl 2.5 (#29014)

Commit

5 years ago

Making torch/csrc/cuda nccl usage safe for nccl 2.5 (#29014) Summary: Thanks to AddyLaddy ptrblck for tracking this fix down. In torch/csrc/cuda/nccl.cpp and torch/csrc/cuda/python_nccl.cpp, construction of the `AutoNcclGroup` guard (which calls `ncclGroupStart()`) [precedes](https://github.com/pytorch/pytorch/pull/29014/files#diff-3b6a42619dd44000cf58c0328b679a1cL239-L241) a possible call to `get_communicators`, which may call `ncclCommInitAll()`. Calling `ncclCommInitAll()` within a `ncclGroupStart()/End()` is incorrect according to our Nccl people. It seemed ok (relevant tests were silently passing) as long as Pytorch was compiled/linked against Nccl 2.4.x (which is currently what's locked into your third_party/nccl subrepo). However, when we tried to compile and link against Nccl 2.5.x in internal builds, we began to see test hangs (TestAutogradDeviceTypeCUDA.test_unused_output_device_cuda was what initially brought it to our attention). The present PR fixes those hangs, as far as we know, and will prevent a nasty future surprise when you start building against nccl 2.5. The backend affected by this PR is exposed via https://github.com/pytorch/pytorch/blob/master/torch/cuda/nccl.py. I'm not sure if the exposure is actually used anywhere (I think the distributed frontend is now backed by ProcessGroupNCCL in torch/lib/c10d). So this PR may affect code that is already dead or dying, but still tested, it seems. I skimmed ProcessGroupNCCL.cpp for potential similar vulnerabilities and didn't spot anything obvious. Pull Request resolved: https://github.com/pytorch/pytorch/pull/29014 Differential Revision: D18274799 Pulled By: ezyang fbshipit-source-id: c5f88cf187960d61736be14458be01e3675c6702

Author

definitelynotmcarilli

Committer

facebook-github-bot

Parents

d8d7af08

pytorch 314066bd - Making torch/csrc/cuda nccl usage safe for nccl 2.5 (#29014)

pytorch
314066bd - Making torch/csrc/cuda nccl usage safe for nccl 2.5 (#29014)