pytorch
4f2639e5 - [FSDP] Fix `FSDP.clip_grad_norm_()` for `NO_SHARD` (#88955)

Commit

2 years ago

[FSDP] Fix `FSDP.clip_grad_norm_()` for `NO_SHARD` (#88955) This PR fixes `FSDP.clip_grad_norm_()` for `NO_SHARD`, which previously "double-counted" each gradient `world_size`-many times. This does not address any discrepancies between `FULL_SHARD` and DDP. (Note that the unit tests do show parity between `FULL_SHARD` and DDP when using `FSDP.clip_grad_norm_()` and `nn.utils.clip_grad_norm_()` respectively on one iteration.) The added unit test code path tests mixing nested FSDP instances with both `FULL_SHARD` and `NO_SHARD` to ensure that the `local_sharded_norm` and `local_nonsharded_norm` computations are interoperating correctly. I want to test non-FSDP root instance in the future, but this is BC breaking since we need to make `clip_grad_norm_()` a static method, which would require a different method call syntax (`FSDP.clip_grad_norm_(root_module, ...)` vs. `root_module.clip_grad_norm_(...)`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/88955 Approved by: https://github.com/zhaojuanmao

Author

awgu

Committer

pytorchmergebot

Parents

46796fe5

pytorch 4f2639e5 - [FSDP] Fix `FSDP.clip_grad_norm_()` for `NO_SHARD` (#88955)

pytorch
4f2639e5 - [FSDP] Fix `FSDP.clip_grad_norm_()` for `NO_SHARD` (#88955)