Initial FSDP2 support (#3394)
* Feat: initial conversion tool draft
* Feat: add value mapping to conversion tool
* Refactor: move from os to pathlib
* Feat: add first tests
* Feat: more tests
* Feat: minor fixes + dataclass conversions
* Feat: more remapping
* Fix: namespace has no attribute version + style
* Fix: offload params behavior
* Feat: add option to only rename keys in the config file to
* Fix: wrong attr name
* Fix: partially resolve comments
* Feat: work on config command + minor fixes to reflect changes
* Refactor: style + quality
* Feat: fsdp2 initial work
* Feat: some cleanups and first running fsdp2
* Fix: version checks + mixed precision policy
* Refactor: style + quality
* Remove obsolete todos
* Feat: grad norm clipping
* Fix: tests + rename attrs
* Refactor: style + quality
* Fix: None object is not iterable
* Fix: default cpu_offload for fsdp2
* Fix: cpu offload now behaves correctly
* Feat: apply_activation_checkpointing
* Fix: append to models
* Feat: start on concept guide
* wip: concept guide
* Fix: toctree
* cleanup of the concept guide
* Fix: minor fixes + mp
* Fix: quality + | to union
* Feat: backwards compatibility + args cleanup
* Fix: style + quality
* Feat: enable dropping refs when getting named params
* Fix: memory footprint with fsdp2
* Feat: cpu ram efficient loading
* Fix: mp
* Fix: not warn about sync_modules if fsdp version is 1
* Refactor: minor changes
* Small fixes + refactors
* Feat: docs + cleanup
* Feat: saving works (not sure about optim)
* More loading/saving work
* Feat: disable local_state_dict for fsdp2
* Fix: fsdp2 convergence
* Feat: working comparison script
* Feat: memory tracking fsdp2
* Feat: memory visualizer
* Feat: more work on benchmark
* Fix: raise error if model+optimizer arent prepared together
* Minor fixes
* Style
* More warnings
* Fix: reshard_after_forward vs sharding_strategy conflict
* Refactor: clean up accelerator
* Feat: more testing in fsdp2 benchmark
* Fix: memory visualizer
* Untested: support load/save_state
* Feat: concept guide improvements
* Refactor: concept guide
* Feat: benchmark works
* Feat: more work on fsdp2 benchmark
* Fix: note syntax
* Fix: small fixes + make original tests work
* Fix: grad scaling
* Feat: reshard after forward tests
* Feat: backward prefetch tests
* Feat: tests for fsdp2
* Refactor: minor fixes
* Feat: fsdp_utils docstrings
* Feat: autodoc fsdp.md
* Docs: get_module_children_bottom_up
* Fix: remove unused images
* Refactor: benchmark cleanup
* Fix: docs
* Feat: final doc changes
* Fix: torch.distributed has no attribute tensor
* Fix: style
* Feat: tests include version in failures
* Fix: benchmark force model to load in fp32
* Fix: rename runs
* Feat: last minor fixes
* Feat: new benchmark images