xla
ZeRO1: Add bucketting logic to control the size of tensors for all-gather/reduce-scatter
#6025

Merged

Commits

add bucketting logic to control the size of tensors for all-gather and reduce-scatter

aws-rhsoln committed 1 year ago
Yapf lint fixes

jeffhataws committed 1 year ago
handle the case when groups is none

aws-rhsoln committed 1 year ago
update zero1

hgt312 committed 1 year ago
yapf lint fixes

jeffhataws committed 1 year ago
Fix missing curly brackets in assertion msg

jeffhataws committed 1 year ago
Fixing FAL issue when sharded params are initialized with torch.double

amithrm committed 1 year ago
Yapf fixes

jeffhataws committed 1 year ago
Fix indices and variable names

jeffhataws committed 1 year ago
Checking of <tensor>.numel for output tensors cause error in GPU runtime

jeffhataws committed 1 year ago
Avoid passing empty input buckets

jeffhataws committed 1 year ago
Fix indent for 2 lines in ZeRO1 (shard.grad = grad_shard, index += 1)

jeffhataws committed 1 year ago
Refactor bucketized all-gather/reduce-scatter functions; add bucket_cap_mb arg

jeffhataws committed 1 year ago
Refactor bucketing logic into a class, shared by all-gather/reduce-scatter

jeffhataws committed 1 year ago
Remove bucket-cap division logic; separate bucket cap for allgather/reducescatter

jeffhataws committed 1 year ago