pytorch
98aad933 - [pytorch][PR] Record FutureNCCL callback stream on CUDA caching allocator (#45318)

Commit

4 years ago

[pytorch][PR] Record FutureNCCL callback stream on CUDA caching allocator (#45318) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45318 When calling `then()` from WorkNCCL, record the input data pointers in futureNCCLCallbackStream_ before the execution of the input callback. Note that the recording cannot be directly added to the lambda used by addCallback in ProcessGroupNCCL.hpp. This is because the type of future value in that context is pyobject rather than TensorList, but a type casting will require pybind and introduce Python dependency, which should not be allowed in c10d library. I have considered creating a util function in a separate file to support this type casting, and then placing it under torch/csrc directory where python dependency is allowed. However, torch/csrc has a dependency on c10d, so this will create a circular dependency. Finally, a `record_stream_cb_` member is added to FutureNCCL, and the default value is nullptr. A default `record_stream_cb_` implementation is added to `PythonFutureWrapper,` where Python dependency is allowed. In addition, a few lines are reformatted by lint. caffe2/torch/csrc/distributed/c10d/init.cpp is only reformatted. #Closes: https://github.com/pytorch/pytorch/issues/44203 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- ProcessGroupNCCLTest buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_accumulate_gradients_no_sync_allreduce_with_then_hook buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_comm_hook_allreduce_with_then_hook_nccl Reviewed By: pritamdamania87 Differential Revision: D23910257 fbshipit-source-id: 66920746c41f3a27a3689f22e2a2d9709d0faa15

Author

Yi Wang

Committer

facebook-github-bot

Parents

ab28bd52

pytorch 98aad933 - [pytorch][PR] Record FutureNCCL callback stream on CUDA caching allocator (#45318)

pytorch
98aad933 - [pytorch][PR] Record FutureNCCL callback stream on CUDA caching allocator (#45318)