pytorch
4a06b8d3 - [FSDP] Add grad accumulation without `no_sync()` (#73535)

Commit View On GitHub

Commit

2 years ago

[FSDP] Add grad accumulation without `no_sync()` (#73535) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73535 **Overview** - This adds FSDP gradient accumulation without `no_sync()`, which comparatively has more network bandwidth demand but less GPU memory requirement per worker. - This fixes a bug in the `no_sync()` testing, where the CPU offloading and backward prefetch arguments were not propagating to the `FullyShardedDataParallel` constructor. - This adds `p_assert()` (taken from Fairscale), which prints the assert error message before raising the `AssertionError`. It is meant to be used when running in the autograd backward context since otherwise the error message is swallowed, giving a unhelpful error like: ``` <built-in method run_backward of torch._C._EngineBase object at 0x7f1fd518dc80> returned NULL without setting an error ``` NOTE: Gradient accumulation without `no_sync()` is not currently compatible with CPU offloading. **Test Plan** I augmented the tests to test gradient accumulation interleaving iterations accumulating with and without `no_sync()`. After this diff: - QPS (ResNet): f328439897 - QPS (RoBERTa): f328440141 - Accuracy: f328442119 Before this diff (trunk): - QPS (ResNet): f328432756 - QPS (RoBERTa): f328436766 - Accuracy: f328437896 Test Plan: Imported from OSS Reviewed By: zhaojuanmao Differential Revision: D34533546 Pulled By: awgu fbshipit-source-id: 821d762dfad5f2b1e59adcb8e5cb7c277399040c (cherry picked from commit 746a5ea2720dcf87c376229b405a318396fe5769)

References

#74332 - Merge master into lazy_tensor_staging

Author

awgu

Committer

pytorchmergebot

Parents

71961d37

pytorch 4a06b8d3 - [FSDP] Add grad accumulation without `no_sync()` (#73535)

Commit

pytorch
4a06b8d3 - [FSDP] Add grad accumulation without `no_sync()` (#73535)