pytorch
8395fdde - Increase tolerance for some distributed tests to 5e-5 (#60462)

Commit
3 years ago
Increase tolerance for some distributed tests to 5e-5 (#60462) Summary: On A100 GPUs 10 tests fail due to slightly higher deviations. This fixes those. Note that rtol is still the default and atol was increased by a factor of 5 (from 1e-5) The failing tests were: - test_accumulate_gradients_module - test_accumulate_gradients_module_with_grad_is_view - test_ddp_checkpointing_once - test_ddp_checkpointing_twice - test_ddp_checkpointing_unused_params - test_ddp_checkpointing_weight_sharing - test_nccl_backend_1gpu_module_device_ids_integer_list - test_nccl_backend_1gpu_module_device_ids_torch_device_list - test_nccl_backend_single_device_module_device_ids_None - test_nccl_backend_single_device_module_empty_device_id Pull Request resolved: https://github.com/pytorch/pytorch/pull/60462 Reviewed By: albanD Differential Revision: D29366145 Pulled By: zhaojuanmao fbshipit-source-id: c3e34c007363dfebf75ccb82004a67e4d2e6f3cd
Author
Parents
Loading