[2/n] Thread PG: add class _World to distributed_c10d.py (#781) (#88471)
Summary:
X-link: https://github.com/pytorch/torchrec/pull/781
Move a bunch of globals to instance methods and replace all use to them.
We move all PG related globals under World and use a singleton instance under _world.
This creates an undocumented extension point to inject full control of how how c10d
state behaves.
One simple hack is to change _world to an implementation that uses a threadlocal
and enable per-thread PGs.
It almost get DDP working and the PG is missing an implementation of all_reduce.
This enables notebook usage of PTD, which is a big deal for learning it:
https://gist.github.com/kumpera/32cb051fa26b8cad8bdf671f968dcd68
This change ensures BC by keeping the global variables around and have the default _World wrap it.
I have relinked this diff to a new github PR, so that I can update it. The original PR is
> Pull Request resolved: https://github.com/pytorch/pytorch/pull/86348
Differential Revision: D40236769
Pulled By: yhcharles
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88471
Approved by: https://github.com/gnadathur, https://github.com/rohan-varma