pytorch
2067b768 - [FSDP] Delay moving tensor to CPU until necessary for optim_state_dict() (#85761)

Commit
3 years ago
[FSDP] Delay moving tensor to CPU until necessary for optim_state_dict() (#85761) Optimizer state_dict currently move tensors to CPU() immediately after allgather(). However, for sharded optimizer state_dict, this moving is duplicated. We should wait until all the sharding are done. This PR may slightly reduce the performance of full optimizer state_dict as it has to allocate more memory than w/o this PR. But the benchmark shows the memory allocation is pretty light. Differential Revision: [D39855912](https://our.internmc.facebook.com/intern/diff/D39855912/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39855912/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/85761 Approved by: https://github.com/rohan-varma
Author
Committer
Parents
Loading