[c10d] Add a soft error handling mode (#84386)
Adding new value "2" to env `NCCL_ASYNC_ERROR_HANDLING` standing for a "CleanUpOnly" error handling mode.
Comparing to `NCCL_ASYNC_ERROR_HANDLING=1`, the "CleanUpOnly" mode will just abort the collectives and NCCL communicators, and will not tear down the process.
User will have the chance to query the state of the process group (in a later PR) and abort the process group (in a later PR), and re-create the process group if needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84386
Approved by: https://github.com/rohan-varma