Add full test suite workflow (#7795)
We have been disabled the full unit test workflow for a while. This PR
migrates the full test to our AWS test infra.
To make the tests pass, we need to merge these PRs:
- #7786
- #7788
- #7789
- #7790
- #7793
- #7794
In addition having these PRs merged, this PR has the following changes
in the full test workflow and test harness:
- Ignore flags for some known issues:
- nvme: Requires an actual NVMe device. Our CI currently doesn't have
NVMe storage configured
- GDS: GDS requires special kernel drivers and NVIDIA Magnum IO to
enable direct GPU-to-storage transfers. CI instances don't have this
configured.
- Zenflow: 1. Stage 3 bugs: The ZenFlow + ZeRO Stage 3 implementation
has pre-existing bugs that cause internal pytest errors and worker
crashes, 2. CUDA/fork incompatibility: test_zf_torch_adam.py uses
torch.optim.AdamW which does CUDA graph capture checks that fail in
forked processes (--forked flag, we can just move it to sequential
tests)
- `/mnt/aio` mount for async I/O tests
- CUTLASS installation for Evoformer tests
- Add `DS_DISABLE_REUSE_DIST_ENV` to the test harness to prevent worker
cleanup hangs
Once we merge this PR, we will be able to run the full test manually or
at scheduled times.
---------
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>