[ptd] make multithreaded pg wait for readiness before the 1st collective (#106954)
Summary:
This used to be not a problem because in c10d collective init, a store based barrier would be applied.
This recently got changed in https://github.com/pytorch/pytorch/pull/103033
where the barrier is not by default applied.
For normal PGs like gloo/nccl, this is not a problem as the rendezvous process is implicitly a barrier anyway.
But for threaded pg, without the store based barrier this would lead to race condition as the local pg does not wait for world to be ready before starting collectives.
This fixes the issue by just doing a store based barrier for each pg created.
The CV attempt wouldn't work since that would still rely on class level variables which would break in the device mesh case. See inline comment for details.
Differential Revision: D48220125
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106954
Approved by: https://github.com/wanchaol, https://github.com/H-Huang, https://github.com/XilunWu