[PJRT] Experimental support for `torch.distributed` and DDP on TPU v2/v3 (#4520)
* Implement multithreaded XLA process group
* Fix tests
* Merge PJRT MNIST test
* formatting
* Clarify random generation in test_ddp.py
* Mark some variables private
* Remove some extra comments
* Add test that uses env:// method
* Explain local RNG
* Explain --pjrt_distributed flag