pytorch
7d517cf9 - [NCCL] Dedicated stream to run all FutureNCCL callbacks. (#43447)

Commit View On GitHub

Commit

4 years ago

[NCCL] Dedicated stream to run all FutureNCCL callbacks. (#43447) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43447 Two main better-engineering motivations to run all FutureNCCL callbacks on a dedicated stream: 1. Each time a then callback was called, we would get a stream from the pool and run the callback on that stream. If we observe the stream traces using that approach, we would see a lot of streams and debugging would become more complicated. If we have a dedicated stream to run all then callback operations, the trace results will be much cleaner and easier to follow. 2. getStreamFromPool may eventually return the default stream or a stream that is used for other operations. This can cause slowdowns. Unless then callback takes longer than preceding allreduce, this approach will be as performant as the previous approach. ghstack-source-id: 110909401 Test Plan: Perf trace runs to validate the desired behavior: See the dedicated stream 152 is running the then callback operations: {F299759342} I run pytorch.benchmark.main.workflow using resnet50 and 32 GPUs registering allreduce with then hook. See f213777896 [traces](https://www.internalfb.com/intern/perfdoctor/results?run_id=26197585) After updates, same observation: see f214890101 Reviewed By: malfet Differential Revision: D23277575 fbshipit-source-id: 67a89900ed7b70f3daa92505f75049c547d6b4d9

Author

sinannasir

Committer

facebook-github-bot

Parents

3f5ea236

pytorch 7d517cf9 - [NCCL] Dedicated stream to run all FutureNCCL callbacks. (#43447)

Commit

pytorch
7d517cf9 - [NCCL] Dedicated stream to run all FutureNCCL callbacks. (#43447)