Add multi-host GPU support (#5657)
* add prints
* to be continued.
* made torchrun works on single host
* Add an example of resnet torchrun
* add prints
* use local rank for allowed_devices.
* remove unwanted comments
* remove comments
* Add torchrun test to the CI.
* added a ll_reduce test
* fix ci failures
* remove some comments
* provide an alternative way to set the port for coordinator.
* fix test by destroying the process group after the test
* fix the single host test.
* fix single host gpu tests.
* add reduce scatter test
* fix comments
* fix a comment
* fix comments
* fix linter
* fix comments
* Use Local_WORLD_SIZE for spawn case.
* fix more comments