Run libtorch in 2 shards (manual sharding) (#102554)
This is a quick way to mitigate libtorch timing out issue on 2nd shard when running with memory leak check, for example https://github.com/pytorch/pytorch/actions/runs/5119293905/jobs/9204880456
### Testing
* Slow gradcheck https://github.com/pytorch/pytorch/actions/runs/5128253177
* `slow / linux-bionic-cuda12.1-py3-gcc7-slow-gradcheck / test (default, 1, 4, linux.4xlarge.nvidia.gpu)`: `3h40` → `3h20`?
* `slow / linux-bionic-cuda12.1-py3-gcc7-slow-gradcheck / test (default, 2, 4, linux.4xlarge.nvidia.gpu)`: `4h30` → `3h50`
* `linux-bionic-cuda12.1-py3-gcc7-slow-gradcheck / test (default, 1, 4, linux.4xlarge.nvidia.gpu, mem_leak_check)`: `3h35` → `3h20`
* `linux-bionic-cuda12.1-py3-gcc7-slow-gradcheck / test (default, 2, 4, linux.4xlarge.nvidia.gpu, mem_leak_check)`: `4h20` → `4h`
* Linux GPU https://github.com/pytorch/pytorch/actions/runs/5128252752
* `linux-bionic-cuda11.8-py3.10-gcc7 / test (default, 1, 5, linux.4xlarge.nvidia.gpu)`: `1h40` → `1h40`
* `linux-bionic-cuda11.8-py3.10-gcc7 / test (default, 2, 5, linux.4xlarge.nvidia.gpu)`: `2h10` → `1h35`
* `linux-bionic-cuda11.8-py3.10-gcc7 / test (default, 1, 5, linux.4xlarge.nvidia.gpu, mem_leak_check)`: `2h30` → `2h50`
* `linux-bionic-cuda11.8-py3.10-gcc7 / test (default, 2, 5, linux.4xlarge.nvidia.gpu, mem_leak_check)`: `3h20` → `2h50`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102554
Approved by: https://github.com/clee2000