Add multi-host GPU support #5657
wbmc
commented
on 2023-09-29
miladm
commented
on 2023-10-10
vanbasten23
marked this pull request as ready for review 2 years ago
vanbasten23
force pushed
from
0cb25b4a
to
5bd99399
2 years ago
add prints
e8bf5b0e
to be continued.
be3aca3b
made torchrun works on single host
d7ced14e
Add an example of resnet torchrun
4a8157e7
add prints
ae8255b3
use local rank for allowed_devices.
b8674b83
remove unwanted comments
243fa5fd
remove comments
efb49edb
Add torchrun test to the CI.
2b1afdf6
added a ll_reduce test
916996ae
fix ci failures
3f253809
remove some comments
93c03ac2
provide an alternative way to set the port for coordinator.
adffcc56
fix test by destroying the process group after the test
87cc9b85
fix the single host test.
3b8d6ed9
fix single host gpu tests.
fc06035d
add reduce scatter test
82e9d439
fix comments
e43b49b3
fix a comment
4bfb3603
fix comments
02fb0567
fix linter
9d87570b
fix comments
f3a065f3
Use Local_WORLD_SIZE for spawn case.
df4e450f
vanbasten23
force pushed
from
6e836ba0
to
df4e450f
2 years ago
fix more comments
37a3be1e
jonb377
approved these changes
on 2023-10-17
Assignees
No one assigned
Login to write a write a comment.
Login via GitHub