[FSDP2] Added autograd/memory/overlap/frozen/2D/AC tests (#118136)
This PR adds tests for autograd (mainly backward hooks), memory, overlap, and frozen parameters.
- Autograd: unused forward output, unused forward module, non-tensor activations (common in internal models)
- Memory: expected GPU memory usage after init, forward, backward, and optimizer step
- Overlap: communication/computation overlap in forward and backward
- Frozen: expected reduce-scatter size, training parity
This PR adds some initial 2D (FSDP + TP) training and model state dict tests. The only change required for model sharded state dict is to make sure parameters are sharded before save and load.
This PR adds tests that `fully_shard` can use `torch.utils.checkpoint`, `_composable.checkpoint`, and `CheckpointWrapper` on a transformer.
(I squashed all of these into one PR now to save CI cost.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118136
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
ghstack dependencies: #119550