pytorch
6e1127ea - [NCCL] Changed FutureNCCL's then callback logic for better efficiency. (#42869)

Commit
4 years ago
[NCCL] Changed FutureNCCL's then callback logic for better efficiency. (#42869) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42869 We realized that when we invoke a simple callback that divides the tensors by `world_size` after `allreduce`, the performance was almost 50% lower in terms of QPS compared to the case where a simple `allreduce` hook is used with no `then` callback. The main problem was as we call `work.wait()` before invoking `then` callback, we were synchronizing `work`'s stream with the default PyTorch stream inside [`runHook`](https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/reducer.cpp#L609) and stalling the backward computation. In that PR, we ensure that FutureNCCL's `then` callback is not stalling the backward computation. Assuming single-process single-device, `FutureNCCL` gets a new stream from device's pool using `at::cuda::getStreamFromPool` to run `callback` and before invoking the `callback` inline it synchronizes `WorkNCCL`'s stream by callback's stream not the default stream. ghstack-source-id: 110208431 Test Plan: Run performance benchmark tests to validate performance issue is resolved. Also, `python test/distributed/test_c10d.py` to avoid any odd issues. Reviewed By: pritamdamania87 Differential Revision: D23055807 fbshipit-source-id: 60e50993f1ed97497514eac5cb1018579ed2a4c5
Author
Parents
Loading