Fix full CI test isolation for ZeRO chmod and NVMe quantization tests (#8008)
## Summary
This PR fixes two intermittent full-CI test isolation
[failures](https://github.com/deepspeedai/DeepSpeed/actions/runs/25789145638/job/75749943219)
observed in the scheduled `aws-torch-latest-full` workflow.
- Avoid TCP `env://` rendezvous port collisions in
`TestZeRONonDistributed::test_chmod_exception_handling`.
- Give the NVMe int4 quantization tests per-test offload directories
instead of sharing `~/tmp_offload_dir`.
## Root Cause
- The ZeRO chmod test sets `world_size = 1`, but it disabled the
distributed test harness initialization while still calling
`deepspeed.initialize()`. In the full CI `pytest-xdist -n 8`
environment, this could fall back to TCP rendezvous and collide on the
selected `MASTER_PORT`.
- The NVMe quantization tests both used the same `~/tmp_offload_dir`.
When the post-init NVMe test and the quantized-initialization NVMe test
ran concurrently on different xdist workers, one worker could remove or
recreate rank-local swap files while the other worker was still reading
them.
## Changes
- Let `TestZeRONonDistributed` use the existing file-store distributed
test harness initialization.
- Add an optional `nvme_path` argument to the NVMe quantization helpers.
- Pass a `tmpdir`-scoped `nvme_offload` path from each NVMe test.
The full workflow passed with this PR branch:
https://github.com/deepspeedai/DeepSpeed/actions/runs/25842039450
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>