Remove process group barrier and all_reduce function calls from tensorpipe agent (#65946)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65946
Add new function in agent_utils to perform a synchronization of active call counts using store. This is intended to replace the barrier and all_reduce used by the process group in RPC shutdown.
`test_ddp_comparison` and `test_ddp_comparison_uneven_inputs` test fail with these changes. It seems like the RPC agents are not accessing the same store, so the total count of processes never reaches the world size to exit the barrier, still ened to investigate why it is like this only for these test cases. Setting clean_shutdown to false ignores this code path which allows the test to pass.
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang
Test Plan: Imported from OSS
Reviewed By: jbschlosser
Differential Revision: D31762736
Pulled By: H-Huang
fbshipit-source-id: cb5d0efe196f72726c63393c4293e97ec4f18548