[FSDP] Add grad accumulation without `no_sync()` (#73535)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73535
**Overview**
- This adds FSDP gradient accumulation without `no_sync()`, which comparatively has more network bandwidth demand but less GPU memory requirement per worker.
- This fixes a bug in the `no_sync()` testing, where the CPU offloading and backward prefetch arguments were not propagating to the `FullyShardedDataParallel` constructor.
- This adds `p_assert()` (taken from Fairscale), which prints the assert error message before raising the `AssertionError`. It is meant to be used when running in the autograd backward context since otherwise the error message is swallowed, giving a unhelpful error like:
```
<built-in method run_backward of torch._C._EngineBase object at 0x7f1fd518dc80> returned NULL without setting an error
```
NOTE: Gradient accumulation without `no_sync()` is not currently compatible with CPU offloading.
**Test Plan**
I augmented the tests to test gradient accumulation interleaving iterations accumulating with and without `no_sync()`.
After this diff:
- QPS (ResNet): f328439897
- QPS (RoBERTa): f328440141
- Accuracy: f328442119
Before this diff (trunk):
- QPS (ResNet): f328432756
- QPS (RoBERTa): f328436766
- Accuracy: f328437896
Test Plan: Imported from OSS
Reviewed By: zhaojuanmao
Differential Revision: D34533546
Pulled By: awgu
fbshipit-source-id: 821d762dfad5f2b1e59adcb8e5cb7c277399040c
(cherry picked from commit 746a5ea2720dcf87c376229b405a318396fe5769)