pytorch
6b8bab8e - Fix (4 device) multi-gpu `ShardedGradScaler` Tests in `ciflow/periodic` (#99485)

Commit

1 year ago

Fix (4 device) multi-gpu `ShardedGradScaler` Tests in `ciflow/periodic` (#99485) Fixes #99427 Given the provided CI logs, I ~~suspect~~[^1] `inf` is being hit with the initial (FSDP model) step of the [test in question](https://github.com/pytorch/pytorch/actions/runs/4707887920/jobs/8350225813#step:13:36189). The DDP loss is correct and indicative of two steps being taken but the FSDP loss is approximately half of the loss expected with the first step (suggesting a step was skipped and the scale was halved). I'm further reducing `init_scale` in this PR in order to ~~test the hypothesis~~[^2] (error occurs with 4 device multi-gpu tests only, not the 2 device tests I can verify locally). I'll ensure I add the label `ciflow/periodic`[^3] to future PRs I suspect could potentially exhibit divergent behavior with >2 devices. Ideally all tests would be insensitive to device scaling but I recognize for some tests imposing that design constraint might be more trouble than it's worth. @awgu @huydhn [^1]: Suspicion confirmed [^2]: The relevant periodic tests are [now passing](https://github.com/pytorch/pytorch/actions/runs/4738073998/jobs/8411862508) [^3]: Didn't know that existed, great to know! Pull Request resolved: https://github.com/pytorch/pytorch/pull/99485 Approved by: https://github.com/huydhn

Author

speediedan

Committer

pytorchmergebot

Parents

b0df0cd7

pytorch 6b8bab8e - Fix (4 device) multi-gpu `ShardedGradScaler` Tests in `ciflow/periodic` (#99485)

pytorch
6b8bab8e - Fix (4 device) multi-gpu `ShardedGradScaler` Tests in `ciflow/periodic` (#99485)