Modify nccl_dependency to take dev mode (#79169)
Summary:
Modify nccl_dependency to take dev mode. Default is still the tp2 version
Suggestion from D35919342 are added into this
Test Plan:
NCCL TESTS
Using version dev:
Build:
hpc_comms.use_nccl = dev
```
buck build mode/opt -c hpc_comms.use_nccl=dev -c fbcode.platform=platform009 -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=v100,a100 //param_bench/train/comms/cpp/nccl-tests/src:nccl_allreduce_perf --show-full-output --verbose 1
```
build done successfully
Running test on devgpu:
```
/usr/local/fbcode/platform009/bin/mpirun -np 8 -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=INIT,GRAPH,TUNING,ENV,NET ./buck-out/gen/param_bench/train/comms/cpp/nccl-tests/src/nccl_allreduce_perf -b 8 -e 128M -f 2
```
Result: P507192135 - nccl version from logs "NCCL version 2.10.3dev+cudaCUDA_MAJOR.CUDA_MINOR"
--------
Using version dev_v2.10.3-1:
Build:
hpc_comms.use_nccl=dev_v2.10.3-1
```
buck kill && buck clean && buck build mode/opt -c hpc_comms.use_nccl=dev_v2.10.3-1 -c fbcode.platform=platform009 -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=v100,a100 //param_bench/train/comms/cpp/nccl-tests/src:nccl_allreduce_perf --show-full-output --verbose 1
```
Build done successfully
Running test on devgpu:
```
/usr/local/fbcode/platform009/bin/mpirun -np 8 -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=INIT,GRAPH,TUNING,ENV,NET ./buck-out/gen/param_bench/train/comms/cpp/nccl-tests/src/nccl_allreduce_perf -b 8 -e 128M -f 2
```
Result: P507194570 - nccl version from logs "NCCL version 2.10.3dev+cudaCUDA_MAJOR.CUDA_MINOR"
--------
Using version tp2:
Build:
hpc_comms.use_nccl=tp2
```
buck kill && buck clean && buck build mode/opt -c hpc_comms.use_nccl=tp2 -c fbcode.platform=platform009 -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=v100,a100 //param_bench/train/comms/cpp/nccl-tests/src:nccl_allreduce_perf --show-full-output --verbose 1
```
Build done successfully
Running test on devgpu:
```
/usr/local/fbcode/platform009/bin/mpirun -np 8 -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=INIT,GRAPH,TUNING,ENV,NET ./buck-out/gen/param_bench/train/comms/cpp/nccl-tests/src/nccl_allreduce_perf -b 8 -e 128M -f 2
```
Result: P507195497 - nccl version from logs "NCCL version 2.10.3+cudaCUDA_MAJOR.CUDA_MINOR"
--------
Using version default:
Build:
hpc_comms.use_nccl=tp2
```
buck kill && buck clean && buck build mode/opt -c fbcode.platform=platform009 -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=v100,a100 //param_bench/train/comms/cpp/nccl-tests/src:nccl_allreduce_perf --show-full-output --verbose 1
```
Build done successfully
Running test on devgpu:
```
/usr/local/fbcode/platform009/bin/mpirun -np 8 -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=INIT,GRAPH,TUNING,ENV,NET ./buck-out/gen/param_bench/train/comms/cpp/nccl-tests/src/nccl_allreduce_perf -b 8 -e 128M -f 2
```
Result: P507207374 - nccl version from logs "NCCL version 2.10.3+cudaCUDA_MAJOR.CUDA_MINOR"
--------
RUNNING PARAM COMMS TO TEST CAFFE TORCH INTEGRATION WITH NCCL DEV LIB
Using version dev:
Build:
hpc_comms.use_nccl = dev
```
buck kill && buck clean && buck build mode/opt -c fbcode.platform=platform009 -c hpc_comms.use_nccl=dev -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=v100,a100 //param_bench/train/comms/pt:comms --show-full-output --verbose 1
```
build done successfully
Running test on devgpu:
```
sh ai_codesign/comms/scripts/test_param_local_no_mpi.sh -s 8 --backend nccl --coll all_reduce
```
Result: P507214467 - nccl version from logs "NCCL version 2.10.3dev+cudaCUDA_MAJOR.CUDA_MINOR"
--------
Using version dev_v2.10.3-1:
Build:
hpc_comms.use_nccl=dev_v2.10.3-1
```
buck kill && buck clean && buck build mode/opt -c fbcode.platform=platform009 -c hpc_comms.use_nccl=dev_v2.10.3-1 -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=v100,a100 //param_bench/train/comms/pt:comms --show-full-output --verbose 1
```
Build done successfully
Running test on devgpu:
```
sh ai_codesign/comms/scripts/test_param_local_no_mpi.sh -s 8 --backend nccl --coll all_reduce
```
Result: P507247559 - nccl version from logs "NCCL version 2.10.3dev+cudaCUDA_MAJOR.CUDA_MINOR"
--------
Using version tp2:
Build:
hpc_comms.use_nccl=tp2
```
buck kill && buck clean && buck build mode/opt -c fbcode.platform=platform009 -c hpc_comms.use_nccl=tp2 -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=v100,a100 //param_bench/train/comms/pt:comms --show-full-output --verbose 1
```
Build done successfully
Running test on devgpu:
```
sh ai_codesign/comms/scripts/test_param_local_no_mpi.sh -s 8 --backend nccl --coll all_reduce
```
Result: P507251808 - nccl version from logs "NCCL version 2.10.3+cudaCUDA_MAJOR.CUDA_MINOR"
--------
Using version default:
Build:
hpc_comms.use_nccl=tp2
```
buck kill && buck clean && buck build mode/opt -c fbcode.platform=platform009 -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=v100,a100 //param_bench/train/comms/pt:comms --show-full-output --verbose 1
```
Build done successfully
Running test on devgpu:
```
sh ai_codesign/comms/scripts/test_param_local_no_mpi.sh -s 8 --backend nccl --coll all_reduce
```
Result: P507256357 - nccl version from logs "NCCL version 2.10.3+cudaCUDA_MAJOR.CUDA_MINOR"
Differential Revision: D36873694
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79169
Approved by: https://github.com/kingchc, https://github.com/kwen2501