Increase tolerance for some distributed tests to 5e-5 (#60462)
Summary:
On A100 GPUs 10 tests fail due to slightly higher deviations.
This fixes those.
Note that rtol is still the default and atol was increased by a factor of 5 (from 1e-5)
The failing tests were:
- test_accumulate_gradients_module
- test_accumulate_gradients_module_with_grad_is_view
- test_ddp_checkpointing_once
- test_ddp_checkpointing_twice
- test_ddp_checkpointing_unused_params
- test_ddp_checkpointing_weight_sharing
- test_nccl_backend_1gpu_module_device_ids_integer_list
- test_nccl_backend_1gpu_module_device_ids_torch_device_list
- test_nccl_backend_single_device_module_device_ids_None
- test_nccl_backend_single_device_module_empty_device_id
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60462
Reviewed By: albanD
Differential Revision: D29366145
Pulled By: zhaojuanmao
fbshipit-source-id: c3e34c007363dfebf75ccb82004a67e4d2e6f3cd