Notify other threads before running callbacks (#31713)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31713
- In case the callbacks are heavy/slow, the other threads should be able to start work on the value of the future after the current thread moves the value and unlock the mutex.
- `completed()` is not inlined. Avoid function call overhead.
ghstack-source-id: 96694593
Test Plan: tdb
Differential Revision: D5624371
fbshipit-source-id: 5762e6e894d20108ec9afedd1a6e64bcd97ee3fe