support re-creating/destroying process groups when some trainers recover after failures (#26912)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26912
group name is used as prefix in the c10d store and without a consistent name process group cannot be initialized.
When process group doesn't have an explicit name (only WORLD (default) process group can have an explicit name), we use global _group_counter to generate the name. We need to reset the counter on destruction to allow consistent value to be generated when we re-create process groups after some trainers recover from failure.
Test Plan: existing tests passed
Reviewed By: mrshenli
Differential Revision: D17594268
fbshipit-source-id: 17f4d2746584dadaa5d468085d871ff3e95a1c84