pytorch
8395fdde - Increase tolerance for some distributed tests to 5e-5 (#60462)

Commit

3 years ago

Increase tolerance for some distributed tests to 5e-5 (#60462) Summary: On A100 GPUs 10 tests fail due to slightly higher deviations. This fixes those. Note that rtol is still the default and atol was increased by a factor of 5 (from 1e-5) The failing tests were: - test_accumulate_gradients_module - test_accumulate_gradients_module_with_grad_is_view - test_ddp_checkpointing_once - test_ddp_checkpointing_twice - test_ddp_checkpointing_unused_params - test_ddp_checkpointing_weight_sharing - test_nccl_backend_1gpu_module_device_ids_integer_list - test_nccl_backend_1gpu_module_device_ids_torch_device_list - test_nccl_backend_single_device_module_device_ids_None - test_nccl_backend_single_device_module_empty_device_id Pull Request resolved: https://github.com/pytorch/pytorch/pull/60462 Reviewed By: albanD Differential Revision: D29366145 Pulled By: zhaojuanmao fbshipit-source-id: c3e34c007363dfebf75ccb82004a67e4d2e6f3cd

Author

Flamefire

Committer

facebook-github-bot

Parents

2fa6c762

pytorch 8395fdde - Increase tolerance for some distributed tests to 5e-5 (#60462)

pytorch
8395fdde - Increase tolerance for some distributed tests to 5e-5 (#60462)