pytorch
06ebe2d5 - Add watchdog to TorchElastic agent and trainers (#84081)

Commit
2 years ago
Add watchdog to TorchElastic agent and trainers (#84081) Summary: D38604238 (https://github.com/pytorch/pytorch/commit/3b11b80fc3f9f9a0171abb5eb2299835feba8b04) introduced a named pipe based watchdog timer. This diff uses the named pipe based watchdog timer in TorchElastic agent and training worker processes (in the StuckJobDetector class) to allow the TorchElastic agent to detect the stuck of a training process, and kill the process to create a core dump. Test Plan: ``` buck test mode/dev-nosan //caffe2/test/distributed/elastic/agent/server/test:local_agent_test ``` ``` RemoteExecution session id: reSessionID-0bfcacef-24d1-42bc-a1d3-f3058fc42b2f-tpx Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/7318349503394739 ✓ ListingSuccess: caffe2/test/distributed/elastic/agent/server/test:local_agent_test : 55 tests discovered (22.699) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_barrier_failed_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (47.140) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_distributed_sum_homogeneous_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (49.198) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_happy_function_c10d (local_elastic_agent_test.LocalElasticAgentTest) (46.387) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_happy_function_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (46.094) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_bipolar_function_etcd (local_elastic_agent_test.LocalElasticAgentTest) (106.342) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_correct_rank_assignment_homogeneous_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (64.888) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_correct_rank_assignment_homogeneous_etcd (local_elastic_agent_test.LocalElasticAgentTest) (69.158) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_agent_local_watchdog_setup_enabled_etcd (local_elastic_agent_test.LocalElasticAgentTest) (46.965) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_double_agent_elastic_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (79.626) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_function_with_return_value_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (46.113) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_sad_function_etcd (local_elastic_agent_test.LocalElasticAgentTest) (46.487) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_shutdown_called_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (24.358) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_torch_rpc_c10d (local_elastic_agent_test.LocalElasticAgentTest) (48.216) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_distributed_sum_homogeneous_c10d (local_elastic_agent_test.LocalElasticAgentTest) (48.433) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_torch_rpc_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (47.029) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_simple_dist_sum_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (44.357) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_check_master_addr_port_override_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (45.176) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_check_nccl_async_error_handling_env_default_c10d (local_elastic_agent_test.LocalElasticAgentTest) (45.980) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_simple_dist_sum_c10d (local_elastic_agent_test.LocalElasticAgentTest) (47.151) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_simple_dist_sum_etcd (local_elastic_agent_test.LocalElasticAgentTest) (44.614) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_correct_rank_assignment_heterogeneous_etcd (local_elastic_agent_test.LocalElasticAgentTest) (69.099) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_agent_local_watchdog_setup_enabled_c10d (local_elastic_agent_test.LocalElasticAgentTest) (45.367) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_shutdown_called_etcd (local_elastic_agent_test.LocalElasticAgentTest) (22.804) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_double_agent_elastic_c10d (local_elastic_agent_test.LocalElasticAgentTest) (77.560) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_dummy_compute_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (46.050) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_distributed_sum_heterogeneous_c10d (local_elastic_agent_test.LocalElasticAgentTest) (48.088) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_double_agent_elastic_etcd (local_elastic_agent_test.LocalElasticAgentTest) (77.286) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_double_agent_fault_tolerance_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (50.670) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_check_master_addr_port_override_etcd (local_elastic_agent_test.LocalElasticAgentTest) (45.631) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_distributed_sum_heterogeneous_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (50.867) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_double_agent_fault_tolerance_etcd (local_elastic_agent_test.LocalElasticAgentTest) (51.095) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_happy_function_etcd (local_elastic_agent_test.LocalElasticAgentTest) (45.000) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_sad_function_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (45.197) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_distributed_sum_homogeneous_etcd (local_elastic_agent_test.LocalElasticAgentTest) (46.873) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_shutdown_called_c10d (local_elastic_agent_test.LocalElasticAgentTest) (23.160) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_barrier_failed_etcd (local_elastic_agent_test.LocalElasticAgentTest) (43.632) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_torch_rpc_etcd (local_elastic_agent_test.LocalElasticAgentTest) (44.536) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_bipolar_function_c10d (local_elastic_agent_test.LocalElasticAgentTest) (89.859) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_workers_drift_fail_etcd (local_elastic_agent_test.LocalElasticAgentTest) (48.277) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_check_nccl_async_error_handling_env_c10d (local_elastic_agent_test.LocalElasticAgentTest) (43.930) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_bipolar_function_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (87.677) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_workers_drift_success_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (48.965) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_workers_drift_fail_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (50.143) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_workers_drift_success_etcd (local_elastic_agent_test.LocalElasticAgentTest) (46.781) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_function_with_return_value_etcd (local_elastic_agent_test.LocalElasticAgentTest) (45.152) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_barrier_failed_c10d (local_elastic_agent_test.LocalElasticAgentTest) (44.832) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_function_with_return_value_c10d (local_elastic_agent_test.LocalElasticAgentTest) (45.281) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_correct_rank_assignment_heterogeneous_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (74.968) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_agent_local_watchdog_setup_disabled_c10d (local_elastic_agent_test.LocalElasticAgentTest) (46.141) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_dummy_compute_c10d (local_elastic_agent_test.LocalElasticAgentTest) (44.960) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_dummy_compute_etcd (local_elastic_agent_test.LocalElasticAgentTest) (45.292) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_agent_local_watchdog_setup_disabled_etcd (local_elastic_agent_test.LocalElasticAgentTest) (44.611) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_check_env_function_etcd (local_elastic_agent_test.LocalElasticAgentTest) (44.939) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_distributed_sum_heterogeneous_etcd (local_elastic_agent_test.LocalElasticAgentTest) (47.609) ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_sad_function_c10d (local_elastic_agent_test.LocalElasticAgentTest) (45.628) Summary Pass: 55 ListingSuccess: 1 Finished test run: https://www.internalfb.com/intern/testinfra/testrun/7318349503394739 ``` ----------- ``` buck test caffe2/torch/fb/trainer/stuck_detection/tests:stuck_job_detector_test ``` ``` RemoteExecution session id: reSessionID-607a0028-4095-4dfc-b657-55f0807fe621-tpx Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/8162774432794818 ✓ ListingSuccess: caffe2/torch/fb/trainer/stuck_detection/tests:stuck_job_detector_test : 11 tests discovered (39.037) ✓ Pass: caffe2/torch/fb/trainer/stuck_detection/tests:stuck_job_detector_test - test_thrift_api_called (caffe2.torch.fb.trainer.stuck_detection.tests.collect_quickstack_test.CollectQuickstackTrace) (0.655) ✓ Pass: caffe2/torch/fb/trainer/stuck_detection/tests:stuck_job_detector_test - test_setup_local_watchdog (caffe2.torch.fb.trainer.stuck_detection.tests.stuck_job_detector_test.StuckJobDetectorTest) (36.510) ✓ Pass: caffe2/torch/fb/trainer/stuck_detection/tests:stuck_job_detector_test - test_dont_print_when_job_normal (caffe2.torch.fb.trainer.stuck_detection.tests.stuck_job_detector_test.StuckJobDetectorTest) (36.727) ✓ Pass: caffe2/torch/fb/trainer/stuck_detection/tests:stuck_job_detector_test - test_send_watchdog_request_on_batch_callbacks_no_server (caffe2.torch.fb.trainer.stuck_detection.tests.stuck_job_detector_test.StuckJobDetectorTest) (37.060) ✓ Pass: caffe2/torch/fb/trainer/stuck_detection/tests:stuck_job_detector_test - test_quickstack_stuck_job (caffe2.torch.fb.trainer.stuck_detection.tests.stuck_job_detector_test.StuckJobDetectorTest) (37.242) ✓ Pass: caffe2/torch/fb/trainer/stuck_detection/tests:stuck_job_detector_test - test_setup_local_watchdog_disabled (caffe2.torch.fb.trainer.stuck_detection.tests.stuck_job_detector_test.StuckJobDetectorTest) (37.243) ✓ Pass: caffe2/torch/fb/trainer/stuck_detection/tests:stuck_job_detector_test - test_print_stack_trace_when_job_stuck (caffe2.torch.fb.trainer.stuck_detection.tests.stuck_job_detector_test.StuckJobDetectorTest) (37.590) ✓ Pass: caffe2/torch/fb/trainer/stuck_detection/tests:stuck_job_detector_test - test_print_when_stuck (caffe2.torch.fb.trainer.stuck_detection.tests.stuck_job_detector_test.StuckJobDetectorTest) (37.590) ✓ Pass: caffe2/torch/fb/trainer/stuck_detection/tests:stuck_job_detector_test - test_setup_local_watchdog_no_file (caffe2.torch.fb.trainer.stuck_detection.tests.stuck_job_detector_test.StuckJobDetectorTest) (37.589) ✓ Pass: caffe2/torch/fb/trainer/stuck_detection/tests:stuck_job_detector_test - test_signposts_stack_trace_when_job_stuck (caffe2.torch.fb.trainer.stuck_detection.tests.stuck_job_detector_test.StuckJobDetectorTest) (38.132) ✓ Pass: caffe2/torch/fb/trainer/stuck_detection/tests:stuck_job_detector_test - test_send_watchdog_request_on_batch_callbacks (caffe2.torch.fb.trainer.stuck_detection.tests.stuck_job_detector_test.StuckJobDetectorTest) (38.133) Summary Pass: 11 ListingSuccess: 1 Finished test run: https://www.internalfb.com/intern/testinfra/testrun/8162774432794818 ``` Differential Revision: D38930476 Pull Request resolved: https://github.com/pytorch/pytorch/pull/84081 Approved by: https://github.com/d4l3k
Author
Committer
Parents
Loading