C10D extension to enable per-thread PG (#84153)
Move a bunch of globals to instance methods and replace all use to them.
We move all PG related globals under World and use a singleton instance under _world.
This creates an undocumented extension point to inject full control of how how c10d
state behaves.
One simple hack is to change _world to an implementation that uses a threadlocal
and enable per-thread PGs.
It almost get DDP working and the PG is missing an implementation of all_reduce.
This enables notebook usage of PTD, which is a big deal for learning it:
https://gist.github.com/kumpera/32cb051fa26b8cad8bdf671f968dcd68
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84153
Approved by: https://github.com/rohan-varma