[MTPG] Improve all_reduce and handle bwd thread support (#95524)
This implements all reduce ops in all_reduce and a PG being used from a thread different than the one that created it.
We should be this >< close to getting complex training tests working.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95524
Approved by: https://github.com/H-Huang