Fix iterator for ncclCommWatchdog. (#32571)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32571
The watchdog thread would erase an element and call `it--` (implicitly
relying on `it++` in the for loop to position correctly). Although, `it--`
would cause undefined behavior if the iterator is pointing to begin(). As a
result, I've modified the logic to update the iterator appropriately.
I've also enhanced the watchdog thread to catch and log exceptions.
ghstack-source-id: 97150763
Test Plan: waitforbuildbot
Differential Revision: D19551365
fbshipit-source-id: 426835819ad8d467bccf5846b04d14442a342f78