xla
ZeRO1: Add bucketting logic to control the size of tensors for all-gather/reduce-scatter
#6025
Merged

ZeRO1: Add bucketting logic to control the size of tensors for all-gather/reduce-scatter #6025

JackCaoG merged 15 commits into master from jeffhataws_zero1_fixes2
jeffhataws
jeffhataws jeffhataws force pushed from 62c5109a to e692a81c 1 year ago
jeffhataws jeffhataws force pushed from 7c3d92da to 84a509d0 1 year ago
jeffhataws jeffhataws added backport_2.2
jeffhataws jeffhataws requested a review from alanwaketan alanwaketan 1 year ago
jeffhataws jeffhataws requested a review from JackCaoG JackCaoG 1 year ago
jeffhataws jeffhataws force pushed from 84a509d0 to 285a766c 1 year ago
jeffhataws jeffhataws force pushed from a4532576 to 6022c917 1 year ago
JackCaoG
JackCaoG JackCaoG added backport_2.3
jeffhataws
jeffhataws commented on 2024-03-13
jeffhataws
jeffhataws commented on 2024-03-13
jeffhataws
jeffhataws commented on 2024-03-15
aws-rhsoln add bucketting logic to control the size of tensors for all-gather an…
90eda151
jeffhataws Yapf lint fixes
46a069af
aws-rhsoln handle the case when groups is none
8e79997b
hgt312 update zero1
5a87467e
jeffhataws yapf lint fixes
b354c277
jeffhataws Fix missing curly brackets in assertion msg
22e29d37
amithrm Fixing FAL issue when sharded params are initialized with torch.double
96c61cd6
jeffhataws Yapf fixes
6b7ce8fa
jeffhataws Fix indices and variable names
a5de71af
jeffhataws Checking of <tensor>.numel for output tensors cause error in GPU runtime
77b2ad17
jeffhataws jeffhataws force pushed from a8f050e5 to 77b2ad17 1 year ago
jeffhataws
jeffhataws commented on 2024-03-16
JackCaoG
jeffhataws Avoid passing empty input buckets
ae348b24
hgt312
hgt312 commented on 2024-03-19
jeffhataws jeffhataws force pushed from 173ef47d to 13965fd3 1 year ago
jeffhataws Fix indent for 2 lines in ZeRO1 (shard.grad = grad_shard, index += 1)
85863703
jeffhataws jeffhataws force pushed from 13965fd3 to 85863703 1 year ago
jeffhataws
jeffhataws commented on 2024-03-20
jeffhataws Refactor bucketized all-gather/reduce-scatter functions; add bucket_c…
675e7a11
jeffhataws jeffhataws force pushed from ec4b1e05 to 675e7a11 1 year ago
jeffhataws jeffhataws requested a review from hgt312 hgt312 1 year ago
hgt312
hgt312 approved these changes on 2024-03-20
JackCaoG
JackCaoG commented on 2024-03-20
JackCaoG
JackCaoG commented on 2024-03-20
JackCaoG
JackCaoG commented on 2024-03-20
JackCaoG
JackCaoG commented on 2024-03-20
JackCaoG
JackCaoG commented on 2024-03-20
JackCaoG
JackCaoG commented on 2024-03-20
jeffhataws Refactor bucketing logic into a class, shared by all-gather/reduce-sc…
d7c99588
JackCaoG
jeffhataws
jeffhataws Remove bucket-cap division logic; separate bucket cap for allgather/r…
5006388f
jeffhataws jeffhataws requested a review from JackCaoG JackCaoG 1 year ago
jeffhataws
JackCaoG
JackCaoG approved these changes on 2024-03-21
JackCaoG
jeffhataws
jeffhataws commented on 2024-03-22
JackCaoG JackCaoG merged e75677f1 into master 1 year ago
jeffhataws jeffhataws deleted the jeffhataws_zero1_fixes2 branch 306 days ago

Login to write a write a comment.

Login via GitHub

Assignees
No one assigned
Labels
Milestone