[FSDP] CPU offload resubmit (#67249)

Commit

3 years ago

[FSDP] CPU offload resubmit (#67249) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67249 Implements CPU offload for model parameters in FSDP. - CPU offload class with only offload_params attribute is created - If this is specified in FSDP ctor, model parameters are moved back to CPU after sharding in __init__ - In forward pass, during lazy init, p._local_shard gets set to p.data so it is on CPU. We pin_memory here. - In forward pass, in _rebuild_full_params, we move p.data back to self.compute_device if necessary. Note that we don't use the device of p._full_param_padded because we don't always have this attr, but when we do its always the same as compute_device. - The same logic as above applies to the beginning of backwards pass. - At end of fwd and end of bwd, `_use_param_local_shard` takes care to ensure the parameters are offloaded to CPU again, by pointing it to p._local_shard, which is always on CPU. Regarding tests: - We tests 3 different types of init: 1) CUDA the model before wrapping with FSDP, 2) CUDA the model after wrapping with FSDP, 3) never CUDA the model. - Case 1 is always supported. Case 2 is not supported with CPU offload and throws an error during fwd pass. Case 3 is only supported with CPU offload at the moment. - Verifies all params are offloaded to CPU after init. - Verifies all params are offloaded to CPU after forward and backward. - Note that there is an issue with verifying exact parity when CPU offloading, but it appears to be related to transfering model back and forth cpu/CUDA. More details in https://github.com/pytorch/pytorch/pull/66961 ghstack-source-id: 141851903 Test Plan: CI Reviewed By: mrshenli Differential Revision: D31911085 fbshipit-source-id: 3ddf73c070b55ce383e62251868d609004fc30e7

References

#68130 - Merge master

Author

rohan-varma

Committer

facebook-github-bot

Parents

06d1be24

pytorch 7f3326a6 - [FSDP] CPU offload resubmit (#67249)

pytorch
7f3326a6 - [FSDP] CPU offload resubmit (#67249)