DeepSpeed
4570c508 - Fix full CI test isolation for ZeRO chmod and NVMe quantization tests (#8008)

Commit
13 days ago
Fix full CI test isolation for ZeRO chmod and NVMe quantization tests (#8008) ## Summary This PR fixes two intermittent full-CI test isolation [failures](https://github.com/deepspeedai/DeepSpeed/actions/runs/25789145638/job/75749943219) observed in the scheduled `aws-torch-latest-full` workflow. - Avoid TCP `env://` rendezvous port collisions in `TestZeRONonDistributed::test_chmod_exception_handling`. - Give the NVMe int4 quantization tests per-test offload directories instead of sharing `~/tmp_offload_dir`. ## Root Cause - The ZeRO chmod test sets `world_size = 1`, but it disabled the distributed test harness initialization while still calling `deepspeed.initialize()`. In the full CI `pytest-xdist -n 8` environment, this could fall back to TCP rendezvous and collide on the selected `MASTER_PORT`. - The NVMe quantization tests both used the same `~/tmp_offload_dir`. When the post-init NVMe test and the quantized-initialization NVMe test ran concurrently on different xdist workers, one worker could remove or recreate rank-local swap files while the other worker was still reading them. ## Changes - Let `TestZeRONonDistributed` use the existing file-store distributed test harness initialization. - Add an optional `nvme_path` argument to the NVMe quantization helpers. - Pass a `tmpdir`-scoped `nvme_offload` path from each NVMe test. The full workflow passed with this PR branch: https://github.com/deepspeedai/DeepSpeed/actions/runs/25842039450 Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Author
Parents
Loading