xla
Implement Fully Sharded Data Parallel (FSDP) in PyTorch XLA
#3431
Merged

Implement Fully Sharded Data Parallel (FSDP) in PyTorch XLA #3431

miladm merged 20 commits into pytorch:master from ronghanghu:xla_fsdp_rebased
ronghanghu
ronghanghu ronghanghu force pushed from dca02d2c to 1c8193c5 3 years ago
ronghanghu ronghanghu force pushed from 1c8193c5 to 2fbc2556 3 years ago
ronghanghu ronghanghu force pushed from 2fbc2556 to ea8a07a9 3 years ago
ronghanghu ronghanghu force pushed from ea8a07a9 to 614f876e 3 years ago
ronghanghu ronghanghu force pushed from 614f876e to bc0ccb3c 3 years ago
ronghanghu ronghanghu force pushed from bc0ccb3c to 59b62e1c 3 years ago
ronghanghu ronghanghu force pushed from 59b62e1c to 935c60a8 3 years ago
ronghanghu ronghanghu force pushed from 935c60a8 to 2f3a85ad 3 years ago
ronghanghu ronghanghu force pushed from 96ae7fe6 to accf50db 3 years ago
ronghanghu ronghanghu force pushed from accf50db to 288c882e 3 years ago
ronghanghu ronghanghu force pushed from 288c882e to 74692e8e 3 years ago
ronghanghu ronghanghu force pushed from 74692e8e to 79227596 3 years ago
ronghanghu ronghanghu force pushed from 79227596 to f15140b9 3 years ago
ronghanghu ronghanghu marked this pull request as ready for review 3 years ago
yeounoh yeounoh requested a review from yeounoh yeounoh 3 years ago
miladm miladm requested a review from miladm miladm 3 years ago
miladm miladm requested a review from mrshenli mrshenli 3 years ago
hjm-aws
hjm-aws requested changes on 2022-03-30
hjm-aws
hjm-aws commented on 2022-03-31
hjm-aws hjm-aws requested a review from hjm-aws hjm-aws 3 years ago
hjm-aws
hjm-aws requested changes on 2022-03-31
miladm
hjm-aws
ronghanghu
ronghanghu
hjm-aws
ronghanghu
ronghanghu
hjm-aws
ronghanghu
ronghanghu ronghanghu force pushed from 65af8177 to 414da498 3 years ago
miladm
miladm commented on 2022-04-14
ronghanghu
ronghanghu
miladm
miladm commented on 2022-04-28
miladm
miladm
ronghanghu
ronghanghu ronghanghu force pushed from 776816fb to 32e9032d 3 years ago
ronghanghu ronghanghu force pushed from 84f3e1b9 to aad4002a 3 years ago
ronghanghu ronghanghu force pushed from aad4002a to f3850055 3 years ago
ronghanghu ronghanghu force pushed from f3850055 to 1b97474c 3 years ago
ronghanghu ronghanghu force pushed from 2a924b69 to f47733d6 3 years ago
ronghanghu ronghanghu force pushed from e4d1240b to bd7cd0f4 3 years ago
miladm
ronghanghu
miladm
ronghanghu
ronghanghu
JackCaoG
ronghanghu Implement Fully Sharded Data Parallel (FSDP) in PyTorch XLA
9a424948
ronghanghu move the FSDP module to `torch_xla.distributed`
a782a618
ronghanghu adding `mark_step_on_freeing` as a temp workaround to #3455
e202d52c
ronghanghu check in __init__ whether the module is already FSDP; fix exception t…
feac851e
ronghanghu add `optimization_barrier_` (https://github.com/pytorch/xla/pull/3493…
6cc577fd
ronghanghu also apply `xm.optimization_barrier_` to FSDP output's gradients
86b85238
ronghanghu deprecate `mark_step_on_freeing` (since we have optimization barrier …
406de3f8
ronghanghu add option to run a dummy forward pass in FSDP
abe7b568
ronghanghu add `_shard_size_multiple` to make sharded parameters a multiple of 1…
efd98ac6
ronghanghu refactor optimization_barrier_ to separately apply to forward and bac…
f6119ca2
ronghanghu seal off more relevant ops w/ optimization_barrier_ to avoid undesire…
d1d1483f
ronghanghu remove obsolete `mark_step_on_freeing` and `use_all_gather_via_all_re…
267f4f66
ronghanghu handle keyword arguments in `checkpoint_module`
09806f1f
ronghanghu add gradient checkpointing option to MNIST and ImageNet FSDP examples
ec9ee681
ronghanghu refactor `optimization_barrier` and only apply it in forward or backw…
d2b95978
ronghanghu refactor command line tool to consolidate sharded checkpoints
eeca5bb0
ronghanghu address reviewers' comments from GitHub
327c78ea
ronghanghu add more user instructions for checkpoint consolidation
b12ea4d4
ronghanghu change `flatten_parameters` default to False since it didn't bring an…
4e51a277
ronghanghu documentation refinement
191ac9b4
ronghanghu ronghanghu force pushed from a5855a1c to 191ac9b4 3 years ago
ronghanghu
ronghanghu
JackCaoG
miladm
ronghanghu
miladm
miladm
miladm
miladm approved these changes on 2022-05-09
miladm miladm merged 3c83269f into master 3 years ago
hjm-aws
ronghanghu
ronghanghu
hjm-aws
ronghanghu

Login to write a write a comment.

Login via GitHub

Assignees
No one assigned
Labels
Milestone