[NCCL] Dedicated stream to run all FutureNCCL callbacks. (#43447)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43447
Two main better-engineering motivations to run all FutureNCCL callbacks on a dedicated stream:
1. Each time a then callback was called, we would get a stream from the pool and run the callback on that stream. If we observe the stream traces using that approach, we would see a lot of streams and debugging would become more complicated. If we have a dedicated stream to run all then callback operations, the trace results will be much cleaner and easier to follow.
2. getStreamFromPool may eventually return the default stream or a stream that is used for other operations. This can cause slowdowns.
Unless then callback takes longer than preceding allreduce, this approach will be as performant as the previous approach.
ghstack-source-id: 110909401
Test Plan:
Perf trace runs to validate the desired behavior:
See the dedicated stream 152 is running the then callback operations:
{F299759342}
I run pytorch.benchmark.main.workflow using resnet50 and 32 GPUs registering allreduce with then hook.
See f213777896 [traces](https://www.internalfb.com/intern/perfdoctor/results?run_id=26197585)
After updates, same observation: see f214890101
Reviewed By: malfet
Differential Revision: D23277575
fbshipit-source-id: 67a89900ed7b70f3daa92505f75049c547d6b4d9