xla
Add multi-host GPU support
#5657
Merged

Add multi-host GPU support #5657

vanbasten23 merged 24 commits into master from multihostgpu_poc_3
vanbasten23
vanbasten23 vanbasten23 added DO_NOT_REVIEW_YET
JackCaoG
wbmc
wbmc
wbmc commented on 2023-09-29
miladm
miladm commented on 2023-10-10
vanbasten23 vanbasten23 marked this pull request as ready for review 2 years ago
vanbasten23 vanbasten23 removed DO_NOT_REVIEW_YET
vanbasten23 vanbasten23 requested a review from jonb377 jonb377 2 years ago
vanbasten23 vanbasten23 requested a review from will-cromar will-cromar 2 years ago
vanbasten23
vanbasten23 commented on 2023-10-12
jonb377
jonb377 commented on 2023-10-12
yeounoh yeounoh requested a review from yeounoh yeounoh 2 years ago
will-cromar
will-cromar commented on 2023-10-12
will-cromar
will-cromar commented on 2023-10-12
will-cromar
will-cromar commented on 2023-10-12
will-cromar
will-cromar commented on 2023-10-12
vanbasten23 vanbasten23 force pushed from 0cb25b4a to 5bd99399 2 years ago
will-cromar
will-cromar commented on 2023-10-14
vanbasten23
vanbasten23 add prints
e8bf5b0e
vanbasten23 to be continued.
be3aca3b
vanbasten23 made torchrun works on single host
d7ced14e
vanbasten23 Add an example of resnet torchrun
4a8157e7
vanbasten23 add prints
ae8255b3
vanbasten23 use local rank for allowed_devices.
b8674b83
vanbasten23 remove unwanted comments
243fa5fd
vanbasten23 remove comments
efb49edb
vanbasten23 Add torchrun test to the CI.
2b1afdf6
vanbasten23 added a ll_reduce test
916996ae
vanbasten23 fix ci failures
3f253809
vanbasten23 remove some comments
93c03ac2
vanbasten23 provide an alternative way to set the port for coordinator.
adffcc56
vanbasten23 fix test by destroying the process group after the test
87cc9b85
vanbasten23 fix the single host test.
3b8d6ed9
vanbasten23 fix single host gpu tests.
fc06035d
vanbasten23 add reduce scatter test
82e9d439
vanbasten23 fix comments
e43b49b3
vanbasten23 fix a comment
4bfb3603
vanbasten23 fix comments
02fb0567
vanbasten23 fix linter
9d87570b
vanbasten23 fix comments
f3a065f3
will-cromar
will-cromar commented on 2023-10-16
will-cromar
will-cromar commented on 2023-10-16
will-cromar
will-cromar approved these changes on 2023-10-16
vanbasten23 Use Local_WORLD_SIZE for spawn case.
df4e450f
vanbasten23 vanbasten23 force pushed from 6e836ba0 to df4e450f 2 years ago
vanbasten23 fix more comments
37a3be1e
vanbasten23 vanbasten23 requested a review from jonb377 jonb377 2 years ago
vanbasten23 vanbasten23 requested a review from wbmc wbmc 2 years ago
vanbasten23 vanbasten23 requested a review from miladm miladm 2 years ago
jonb377
jonb377 approved these changes on 2023-10-17
vanbasten23 vanbasten23 merged 6ea99474 into master 2 years ago

Login to write a write a comment.

Login via GitHub

Assignees
No one assigned
Labels
Milestone