Fix OOMing periodic shards (#95246)

Commit

1 year ago

Fix OOMing periodic shards (#95246) Tests have been consistently failing with the error on the following shards with the error `RuntimeError: CUDA error: out of memory` - `periodic / linux-bionic-cuda11.7-py3-gcc7-slow-gradcheck / test (default, 1, 2, linux.4xlarge.nvidia.gpu)` - `periodic / linux-bionic-cuda11.7-py3-gcc7-slow-gradcheck / test (default, 2, 2, linux.4xlarge.nvidia.gpu)` Seeing if serializing those test files makes the periodic jobs succeed again. This feels a bit off since there are so many different test files that have failed and need to be serialized, indicating a potential perf regression somewhere Failures on hud: https://hud.pytorch.org/hud/pytorch/pytorch/master/1?per_page=100&name_filter=periodic%20%2F%20linux-bionic-cuda11.7-py3-gcc7-slow-gradcheck%20%2F%20test%20(default%2C%20 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95246 Approved by: https://github.com/Skylion007, https://github.com/huydhn

Author

ZainRizvi

Committer

pytorchmergebot

Parents

bdb78e52

pytorch c97275ac - Fix OOMing periodic shards (#95246)

pytorch
c97275ac - Fix OOMing periodic shards (#95246)