[TensorPipe] Implement join correctly (#38933)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38933
Based on what I could understand from how the RPC shutdown operates and from what the ProcessGroup agent does, the join method is supposed to act as a barrier among all workers that waits until they all have finished all their pending work, including work that may be triggered by nested calls or by callbacks.
ghstack-source-id: 104760684
Test Plan: Before this diff, the `test_user_rrefs_confirmed` test of the RPC suite was flakily deadlocking. After this, I haven't been able to repro that.
Differential Revision: D21703020
fbshipit-source-id: 3d36c6544f1ba8e17ce27ef520ecfd30552045dd