[RFC] Allow elastic agent to fail fast (#99051)
Summary: Today, on a segfault on a single trainer , we end up keeping the gpu on all ranks blocked for 5 minutes due to elastic agents barrier timeouts
Test Plan: Rely on existing test to validate . Looking to get some feedback on adding UTs
Differential Revision: D44929488
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99051
Approved by: https://github.com/kurman, https://github.com/kiukchung