pytorch
a548fab8 - Add size info to collective logs (#100413)

Commit
1 year ago
Add size info to collective logs (#100413) Previous timeout log does not print size info. Making it hard to debug hang caused by message size mismatch. (Reason is that when copying `WorkNCCL` object during work enqueue, we don't copy `outputs_` due to reference concern, hence `output.size()` is never triggered.) This PR logs sizes using separate fields, hence not relying on `outputs_`. New timeout log: ``` [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=_ALLGATHER_BASE, NumelIn=209715200, NumelOut=1677721600, Timeout(ms)=10000) ran for 10957 milliseconds before timing out. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/100413 Approved by: https://github.com/kumpera
Author
Committer
Parents
Loading