pytorch
08a8a37f - [FSDP] Set `NCCL_DESYNC_DEBUG=0` for FSDP unit tests (#99916)

Commit
2 years ago
[FSDP] Set `NCCL_DESYNC_DEBUG=0` for FSDP unit tests (#99916) This should fix https://github.com/pytorch/pytorch/issues/99011. With `NCCL_DESYNC_DEBUG=0`, we can run 100 iterations of ``` CUDA_LAUNCH_BLOCKING=1 NCCL_DESYNC_DEBUG=1 CUDA_VISIBLE_DEVICES=0,7 numactl -C 2 python test/distributed/fsdp/test_fsdp_core.py -v -k test_transformer_no_grad --repeat 100 2>&1 | tee out ``` without erroring, whereas with `NCCL_DESYNC_DEBUG=1`, we can repro the error with high failure rate (usually <10 iterations). Pull Request resolved: https://github.com/pytorch/pytorch/pull/99916 Approved by: https://github.com/zhaojuanmao
Author
Andrew Gu
Committer
Parents
Loading