pytorch
cba40049 - Run libtorch in 2 shards (manual sharding) (#102554)

Commit

2 years ago

Run libtorch in 2 shards (manual sharding) (#102554) This is a quick way to mitigate libtorch timing out issue on 2nd shard when running with memory leak check, for example https://github.com/pytorch/pytorch/actions/runs/5119293905/jobs/9204880456 ### Testing * Slow gradcheck https://github.com/pytorch/pytorch/actions/runs/5128253177 * `slow / linux-bionic-cuda12.1-py3-gcc7-slow-gradcheck / test (default, 1, 4, linux.4xlarge.nvidia.gpu)`: `3h40` → `3h20`? * `slow / linux-bionic-cuda12.1-py3-gcc7-slow-gradcheck / test (default, 2, 4, linux.4xlarge.nvidia.gpu)`: `4h30` → `3h50` * `linux-bionic-cuda12.1-py3-gcc7-slow-gradcheck / test (default, 1, 4, linux.4xlarge.nvidia.gpu, mem_leak_check)`: `3h35` → `3h20` * `linux-bionic-cuda12.1-py3-gcc7-slow-gradcheck / test (default, 2, 4, linux.4xlarge.nvidia.gpu, mem_leak_check)`: `4h20` → `4h` * Linux GPU https://github.com/pytorch/pytorch/actions/runs/5128252752 * `linux-bionic-cuda11.8-py3.10-gcc7 / test (default, 1, 5, linux.4xlarge.nvidia.gpu)`: `1h40` → `1h40` * `linux-bionic-cuda11.8-py3.10-gcc7 / test (default, 2, 5, linux.4xlarge.nvidia.gpu)`: `2h10` → `1h35` * `linux-bionic-cuda11.8-py3.10-gcc7 / test (default, 1, 5, linux.4xlarge.nvidia.gpu, mem_leak_check)`: `2h30` → `2h50` * `linux-bionic-cuda11.8-py3.10-gcc7 / test (default, 2, 5, linux.4xlarge.nvidia.gpu, mem_leak_check)`: `3h20` → `2h50` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102554 Approved by: https://github.com/clee2000

Author

huydhn

Committer

pytorchmergebot

Parents

d9f75dde

pytorch cba40049 - Run libtorch in 2 shards (manual sharding) (#102554)

pytorch
cba40049 - Run libtorch in 2 shards (manual sharding) (#102554)