Initial FSDP2 support (#3394)

Commit

311 days ago

Initial FSDP2 support (#3394) * Feat: initial conversion tool draft * Feat: add value mapping to conversion tool * Refactor: move from os to pathlib * Feat: add first tests * Feat: more tests * Feat: minor fixes + dataclass conversions * Feat: more remapping * Fix: namespace has no attribute version + style * Fix: offload params behavior * Feat: add option to only rename keys in the config file to * Fix: wrong attr name * Fix: partially resolve comments * Feat: work on config command + minor fixes to reflect changes * Refactor: style + quality * Feat: fsdp2 initial work * Feat: some cleanups and first running fsdp2 * Fix: version checks + mixed precision policy * Refactor: style + quality * Remove obsolete todos * Feat: grad norm clipping * Fix: tests + rename attrs * Refactor: style + quality * Fix: None object is not iterable * Fix: default cpu_offload for fsdp2 * Fix: cpu offload now behaves correctly * Feat: apply_activation_checkpointing * Fix: append to models * Feat: start on concept guide * wip: concept guide * Fix: toctree * cleanup of the concept guide * Fix: minor fixes + mp * Fix: quality + | to union * Feat: backwards compatibility + args cleanup * Fix: style + quality * Feat: enable dropping refs when getting named params * Fix: memory footprint with fsdp2 * Feat: cpu ram efficient loading * Fix: mp * Fix: not warn about sync_modules if fsdp version is 1 * Refactor: minor changes * Small fixes + refactors * Feat: docs + cleanup * Feat: saving works (not sure about optim) * More loading/saving work * Feat: disable local_state_dict for fsdp2 * Fix: fsdp2 convergence * Feat: working comparison script * Feat: memory tracking fsdp2 * Feat: memory visualizer * Feat: more work on benchmark * Fix: raise error if model+optimizer arent prepared together * Minor fixes * Style * More warnings * Fix: reshard_after_forward vs sharding_strategy conflict * Refactor: clean up accelerator * Feat: more testing in fsdp2 benchmark * Fix: memory visualizer * Untested: support load/save_state * Feat: concept guide improvements * Refactor: concept guide * Feat: benchmark works * Feat: more work on fsdp2 benchmark * Fix: note syntax * Fix: small fixes + make original tests work * Fix: grad scaling * Feat: reshard after forward tests * Feat: backward prefetch tests * Feat: tests for fsdp2 * Refactor: minor fixes * Feat: fsdp_utils docstrings * Feat: autodoc fsdp.md * Docs: get_module_children_bottom_up * Fix: remove unused images * Refactor: benchmark cleanup * Fix: docs * Feat: final doc changes * Fix: torch.distributed has no attribute tensor * Fix: style * Feat: tests include version in failures * Fix: benchmark force model to load in fp32 * Fix: rename runs * Feat: last minor fixes * Feat: new benchmark images

References

#3394 - Initial FSDP2 support

Author

S1ro1

Parents

8ab01d32

accelerate d7c741a6 - Initial FSDP2 support (#3394)

accelerate
d7c741a6 - Initial FSDP2 support (#3394)