Fix FutureNCCL's completed() disagreeing with wait() (#48503)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48503
This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).
---
My impression is that one property of the upstream Future class is that once .wait() returns, or once a callback is invoked, then .completed() should return True. This was not the case for FutureNCCL because .wait() would return immediately, and callbacks would be invoked inline, but .completed() could return False if the CUDA async operations hadn't completed yet.
That was odd and confusing. Since there are other ways for users to check the status of CUDA operations (if they really need, and typically I don't think it's so common), perhaps it's best to avoid checking the status of CUDA events in .completed().
ghstack-source-id: 118180028
Test Plan: Unit tests
Reviewed By: mrshenli
Differential Revision: D25180531
fbshipit-source-id: e1207f6b91f010f278923cc5fec1190d0fcdab30