xla
6ea99474 - Add multi-host GPU support (#5657)

Commit
1 year ago
Add multi-host GPU support (#5657) * add prints * to be continued. * made torchrun works on single host * Add an example of resnet torchrun * add prints * use local rank for allowed_devices. * remove unwanted comments * remove comments * Add torchrun test to the CI. * added a ll_reduce test * fix ci failures * remove some comments * provide an alternative way to set the port for coordinator. * fix test by destroying the process group after the test * fix the single host test. * fix single host gpu tests. * add reduce scatter test * fix comments * fix a comment * fix comments * fix linter * fix comments * Use Local_WORLD_SIZE for spawn case. * fix more comments
Author
Parents
Loading