pytorch
fa86874b - Fix intermittent link errors in NCCL build (#84245)

Commit
2 years ago
Fix intermittent link errors in NCCL build (#84245) Should fix #13362 and fix #83790 I think I've discovered the root cause of the intermittent nccl link failures. If we look at the variable name in the redefinition error: ``` _02021d91_11_sendrecv_cu_0bc7b9c8_11152 ``` this is the name of the file being compiled + some form of unique ID. As part of NCCL's build process, the same file is compiled multiple times with different macro definitions depending on which operator and dtype are being compiled, e.g. ``` nvcc -DNCCL_OP=0 -DNCCL_TYPE=0 -dc sendrecv.cu -o sendrecv_sum_i8.o ``` Since the filename parts are the same, then if the unique IDs also happen to collide then the entire identifier will collide and the link fails. So the fix here is to generate a unique `.cu` file for each object file. I've implemented this as a `.patch` file that gets applied from our cmake code, but if we instead fork nccl that would be cleaner. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84245 Approved by: https://github.com/janeyx99, https://github.com/malfet
Author
Committer
Parents
Loading