[NCCL] Fix the initialization of futureNCCLCallbackStreams (#44347)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44347
Cloned from Pull Request resolved: https://github.com/pytorch/pytorch/pull/44097, because the original author Sinan has completed the internship and now is unable to submit this diff.
As johnsonpaul mentioned in D23277575 (https://github.com/pytorch/pytorch/commit/7d517cf96f6d53bbe472cf1404e12b9b75230bb6). It looks like all processes were allocating memory on GPU-ID=0.
I was able to reproduce it by running `test_ddp_comm_hook_allreduce_with_then_hook_nccl` unit test of `test_c10d.py` and running `nvidia-smi` while test was running. The issue was reproduced as:
```
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 3132563 C python 777MiB |
| 0 3132564 C python 775MiB |
| 4 3132564 C python 473MiB |
+-----------------------------------------------------------------------------+
```
I realized that as we initialize ProcessGroupNCCL both processes were initially allocating memory on GPU 0.
We later also realized that I forgot `isHighPriority` input of `getStreamFromPool` and `futureNCCLCallbackStreams_.push_back(std::make_shared<at::cuda::CUDAStream>(at::cuda::getStreamFromPool(device_index)));` was just creating a vector of GPU 0 streams. As i changed `at::cuda::getStreamFromPool(device_index)` to `at::cuda::getStreamFromPool(false, device_index)`. `nvidia-smi` looked like:
```
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 673925 C python 771MiB |
| 0 673926 C python 771MiB |
| 1 673925 C python 771MiB |
| 1 673926 C python 771MiB |
| 2 673925 C python 771MiB |
| 2 673926 C python 771MiB |
| 3 673925 C python 771MiB |
| 3 673926 C python 771MiB |
| 4 673925 C python 771MiB |
| 4 673926 C python 771MiB |
| 5 673925 C python 771MiB |
| 5 673926 C python 771MiB |
| 6 673925 C python 771MiB |
| 6 673926 C python 771MiB |
| 7 673925 C python 707MiB |
| 7 673926 C python 623MiB |
+-----------------------------------------------------------------------------+
```
This confirms that we were just getting GPU 0 streams for the callback. I think this does not explain the `fp16_compress` stability issue, because we were able to reproduce that even without any then callback and just calling copy from fp32 to fp16 before allreduce. However, this can explain other issues where `allreduce` was not on par with `no_hook`. I'll run some additional simulations with this diff.
I tried to to replace `getStreamFromPool` by `getDefaultCUDAStream(deviceIndex)` and it wasn't causing additional memory usage. In this diff, I temporarily solved the issue by just initializing null pointers for each device in the constructor and setting the callback stream for corresponding devices inside `ProcessGroupNCCL::getNCCLComm`. After the fix it looks like the memory issue was resolved:
```
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2513142 C python 745MiB |
| 4 2513144 C python 747MiB |
+-----------------------------------------------------------------------------+
```
I could use a dictionary instead of a vector for `futureNCCLCallbackStreams_`, but since number of devices is fixed, I think it isn't necessary. Please let me know what you think in the comments.
ghstack-source-id: 111485483
Test Plan:
`test_c10d.py` and some perf tests. Also check `nvidia-smi` while running tests to validate memory looks okay.
This diff also fixes the regression in HPC tests as we register a hook:
{F322730175}
See https://fb.quip.com/IGuaAbD8 (https://github.com/pytorch/pytorch/commit/474fdd7e2d268270587bb11a052265bbdccf96a0)bnvy for details.
Reviewed By: pritamdamania87
Differential Revision: D23495436
fbshipit-source-id: ad08e1d94343252224595d7c8a279fe75e244822