(torch/elastic) fix scale down bug caused by calling rdzv_handler.shutdown() on premature agent failures (#67749)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67749
Fixes: https://github.com/pytorch/pytorch/issues/67742
Test Plan:
Added unittests.
Validated manually:
```
# start agent 0
$ torchrun --rdzv_backend c10d --rdzv_id 123 --rdzv_endpoint localhost:29500 --nnodes 1:2 --nproc_per_node 1 --monitor_interval 1 test.py
# start agent 1
torchrun --rdzv_backend c10d --rdzv_id 123 --rdzv_endpoint localhost:29500 --nnodes 1:2 --nproc_per_node 1 --monitor_interval 1 test.py
# kill agent 0
CTRL+C (SIGINT) or kill -15 (SIGTERM)
# restart it
torchrun --rdzv_backend c10d --rdzv_id 123 --rdzv_endpoint localhost:29500 --nnodes 1:2 --nproc_per_node 1 --monitor_interval 1 test.py
```
Reviewed By: cbalioglu
Differential Revision: D32129005
fbshipit-source-id: db292268250ef6f1e06f5b4c5bd67124d8dfd325