pytorch
776079b5 - Fix test_file_system_checkpoint_cpu.py temp directory usage (#93302)

Commit
1 year ago
Fix test_file_system_checkpoint_cpu.py temp directory usage (#93302) Fixes https://github.com/pytorch/pytorch/issues/93245 This failure starts to happen recently. `tempfile.mkdtemp()` has already created the temporary directory, so removing it with `shutil.rmtree`, then recreating it with `os.makedirs` doesn't make much sense to me. The flaky problem here is that `shutil.rmtree` could fail to remove the temporary directory sometimes. Here is the error: ``` ====================================================================== ERROR [1.814s]: test_load_rowwise_to_colwise_thread_count_2 (__main__.TestDistributedReshardOnLoad) ---------------------------------------------------------------------- Traceback (most recent call last): File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 539, in wrapper self._join_processes(fn) File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 765, in _join_processes self._check_return_codes(elapsed_time) File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 810, in _check_return_codes raise RuntimeError(error) RuntimeError: Process 0 exited with error code 10 and exception: Traceback (most recent call last): File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 663, in run_test getattr(self, test_name)() File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 541, in wrapper fn() File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 252, in instantiated_test test(self, **param_kwargs) File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/distributed/_shard/sharded_tensor/__init__.py", line 94, in wrapper func(self, *args, **kwargs) File "/var/lib/jenkins/workspace/test/distributed/checkpoint/test_file_system_checkpoint_cpu.py", line 364, in test_load_rowwise_to_colwise os.makedirs(path) File "/opt/conda/envs/py_3.8/lib/python3.8/os.py", line 223, in makedirs mkdir(name, mode) FileExistsError: [Errno 17] File exists: '/tmp/tmps5rxw4hb' ``` If the temporary directory really needs to be cleaned up, another way would be to remove everything underneath it, but leave the folder alone. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93302 Approved by: https://github.com/kumpera
Author
Committer
Parents
Loading