pytorch
8b13ab93 - Event Logging for NCCL Async Error Handling Process Crash (#47244)

Commit
4 years ago
Event Logging for NCCL Async Error Handling Process Crash (#47244) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47244 This is an event-logging based update that should allow us to collect high-quality data about how many times the NCCL Async Error Handling mechanism is triggered. This logs an event called `ProcessGroupNCCL.WorkNCCL.handleNCCLGuard`, which is recorded as an entry in the `scuba_caffe2_pytorch_usage_stats` Scuba table. This Scuba entry will also contain metadata like workflow status, entitlement, hostnames, and workflow names, which will give us insight into what workloads/domains and machines are benefiting from async error handling. It also contains the Flow Run ID, which can be used as a join key with the `fblearner_workflow_run_status` scuba table for additional information like final error message, etc. We can easily quantify the number of times the async handling code was triggered by querying the `scuba_caffe2_pytorch_usage_stats` table. As a demonstration, I ran the following workflow with this diff patched: f229675892 Since the workflow above causes a desync, the `handleNCCLGuard` event is logged in scuba soon. See here for the filtered table: https://www.fburl.com/scuba/scuba_caffe2_pytorch_usage_stats/tmp1uvio As you can see, there are 4 entries. The workflow above uses 3 GPUs, 2 of which run into the desync scenario and are crashed using async error handling. We make this fail twice before succeeding the 3rd time, hence 4 entries. ghstack-source-id: 115708632 Test Plan: Did a quick demo as described above. Scuba entries with the logs can be found here: https://www.fburl.com/scuba/scuba_caffe2_pytorch_usage_stats/tmp1uvio Reviewed By: jiayisuse Differential Revision: D24688739 fbshipit-source-id: 7532dfeebc53e291fbe10d28a6e50df6324455b1
Author
Parents
Loading