xla
Implement Fully Sharded Data Parallel (FSDP) in PyTorch XLA
#3431

Merged

Implement Fully Sharded Data Parallel (FSDP) in PyTorch XLA #3431

miladm merged 20 commits into pytorch:master from ronghanghu:xla_fsdp_rebased

ronghanghu force pushed from dca02d2c to 1c8193c5 3 years ago

ronghanghu force pushed from 1c8193c5 to 2fbc2556 3 years ago

ronghanghu force pushed from 2fbc2556 to ea8a07a9 3 years ago

ronghanghu force pushed from ea8a07a9 to 614f876e 3 years ago

ronghanghu force pushed from 614f876e to bc0ccb3c 3 years ago

ronghanghu force pushed from bc0ccb3c to 59b62e1c 3 years ago

ronghanghu force pushed from 59b62e1c to 935c60a8 3 years ago

ronghanghu force pushed from 935c60a8 to 2f3a85ad 3 years ago

ronghanghu force pushed from 96ae7fe6 to accf50db 3 years ago

ronghanghu force pushed from accf50db to 288c882e 3 years ago

ronghanghu force pushed from 288c882e to 74692e8e 3 years ago

ronghanghu force pushed from 74692e8e to 79227596 3 years ago

ronghanghu force pushed from 79227596 to f15140b9 3 years ago

ronghanghu marked this pull request as ready for review 3 years ago

yeounoh requested a review from

yeounoh 3 years ago

miladm requested a review from

miladm 3 years ago

miladm requested a review from

mrshenli 3 years ago

hjm-aws requested changes on 2022-03-30

hjm-aws commented on 2022-03-31

hjm-aws requested a review from

hjm-aws 3 years ago

hjm-aws requested changes on 2022-03-31

ronghanghu force pushed from 65af8177 to 414da498 3 years ago

miladm commented on 2022-04-14

miladm commented on 2022-04-28

ronghanghu force pushed from 776816fb to 32e9032d 3 years ago

ronghanghu force pushed from 84f3e1b9 to aad4002a 3 years ago

ronghanghu force pushed from aad4002a to f3850055 3 years ago

ronghanghu force pushed from f3850055 to 1b97474c 3 years ago

ronghanghu force pushed from 2a924b69 to f47733d6 3 years ago

ronghanghu force pushed from e4d1240b to bd7cd0f4 3 years ago

Implement Fully Sharded Data Parallel (FSDP) in PyTorch XLA

9a424948

move the FSDP module to `torch_xla.distributed`

a782a618

adding `mark_step_on_freeing` as a temp workaround to #3455

e202d52c

check in __init__ whether the module is already FSDP; fix exception t…

feac851e

add `optimization_barrier_` (https://github.com/pytorch/xla/pull/3493…

6cc577fd

also apply `xm.optimization_barrier_` to FSDP output's gradients

86b85238

deprecate `mark_step_on_freeing` (since we have optimization barrier …

406de3f8

add option to run a dummy forward pass in FSDP

abe7b568

add `_shard_size_multiple` to make sharded parameters a multiple of 1…

efd98ac6

refactor optimization_barrier_ to separately apply to forward and bac…

f6119ca2

seal off more relevant ops w/ optimization_barrier_ to avoid undesire…

d1d1483f

remove obsolete `mark_step_on_freeing` and `use_all_gather_via_all_re…

267f4f66

handle keyword arguments in `checkpoint_module`

09806f1f

add gradient checkpointing option to MNIST and ImageNet FSDP examples

ec9ee681

refactor `optimization_barrier` and only apply it in forward or backw…

d2b95978

refactor command line tool to consolidate sharded checkpoints

eeca5bb0

address reviewers' comments from GitHub

327c78ea

add more user instructions for checkpoint consolidation

b12ea4d4

change `flatten_parameters` default to False since it didn't bring an…

4e51a277

documentation refinement

191ac9b4

ronghanghu force pushed from a5855a1c to 191ac9b4 3 years ago

miladm approved these changes on 2022-05-09

miladm merged 3c83269f into master 3 years ago

Reviewers

miladm

hjm-aws

yeounoh

mrshenli

Assignees

No one assigned

Labels

None yet

Milestone

No milestone

xla Implement Fully Sharded Data Parallel (FSDP) in PyTorch XLA #3431 Merged

Implement Fully Sharded Data Parallel (FSDP) in PyTorch XLA #3431

xla
Implement Fully Sharded Data Parallel (FSDP) in PyTorch XLA
#3431

Merged