pytorch
003c30ba - Fix FutureNCCL's completed() disagreeing with wait() (#48503)

Commit View On GitHub

Commit

3 years ago

Fix FutureNCCL's completed() disagreeing with wait() (#48503) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48503 This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed). --- My impression is that one property of the upstream Future class is that once .wait() returns, or once a callback is invoked, then .completed() should return True. This was not the case for FutureNCCL because .wait() would return immediately, and callbacks would be invoked inline, but .completed() could return False if the CUDA async operations hadn't completed yet. That was odd and confusing. Since there are other ways for users to check the status of CUDA operations (if they really need, and typically I don't think it's so common), perhaps it's best to avoid checking the status of CUDA events in .completed(). ghstack-source-id: 118180028 Test Plan: Unit tests Reviewed By: mrshenli Differential Revision: D25180531 fbshipit-source-id: e1207f6b91f010f278923cc5fec1190d0fcdab30

Author

Committer

facebook-github-bot

Parents

b91b0872

pytorch 003c30ba - Fix FutureNCCL's completed() disagreeing with wait() (#48503)

Commit

pytorch
003c30ba - Fix FutureNCCL's completed() disagreeing with wait() (#48503)