pytorch
bb1424d4 - Reland #2 "[C10] PG observability hooks. (#108815, #110907)" (#111072)

Commit

1 year ago

Reland #2 "[C10] PG observability hooks. (#108815, #110907)" (#111072) This reverts commit 314a502eb04c6382e2cc9af0573533efba54109d. Changes since original PR: Reland 1 * rename torch.distributed.hooks to torch.distributed._hooks Reland 2 * make _hooks importable even if !distributed.is_available() * handle cuda driver exit intermittent failure caused by new cuda api usage in callback caller (see prev PR in stack) (original PR https://github.com/pytorch/pytorch/pull/108815 desc copied below) Expose a set of observability hooks into C10D such that our users can detect collectives failure both faster and more easily. The design is similar to NCCL desync debug that it minimized the overhead by doing most of the work out of the main thread. This PR introduces a new module torch.distributed.hooks that exposes the following set of methods: register_collective_start_hook register_collective_end_hook register_process_group_hook The process group hook exposes PG creation on the member ranks and call them inline from the the PG creation code. This is fine since this happens during initialization and a limited number of times. The collective start/end hooks are fired from a single background thread. It reads events from a C++ queue and dispatches over. Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown and have it as background thread. This is not possible with more reasonable choices like a condvar. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111072 Approved by: https://github.com/malfet ghstack dependencies: #111061

Author

wconstab

Committer

pytorchmergebot

Parents

dede1e96

pytorch bb1424d4 - Reland #2 "[C10] PG observability hooks. (#108815, #110907)" (#111072)

pytorch
bb1424d4 - Reland #2 "[C10] PG observability hooks. (#108815, #110907)" (#111072)