Delete DDP hooks in Reducer destructor (#21591)
Summary:
Closes https://github.com/pytorch/pytorch/issues/21344
DDP assigns the original module to the first module replica instead of creating a new one. Then, it creates a new Reducer to add post hooks to sync gradients. However, because every reconstructed DDP instance wraps the same original module, all their reducers will add hooks to the same set of variables. This PR deletes DDP hooks from variables when destructing Reducer, trying to make DDP failure recoverable.
pietern kuttas and I discussed the following solutions:
#### Solution 1
Keep `add_post_hook` API intact, and do a `dynamic_cast` in `del_post_hook` to check hook type. If the type matches Reducer's hook, delete it. As pietern mentioned, this will not work if we create multiple DDP instances from the same original model.
#### Solution 2
Use a counter to generate a unique key for every hook in `Function`, and keep them in a map. return the key to the caller of `add_post_hook`, and ask the caller to provide key if it needs to delete the hook.
Con: this would add extra overhead to `add_post_hook` and every `Function` object.
#### Solution 3 [Current implementation]
kuttas suggests that, instead of generating a unique key, directly using the address of the pointer would be better. In order to avoid messing up dereferencing, let `add_post_hook` to return a `uintptr_t`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21591
Differential Revision: D15745706
Pulled By: mrshenli
fbshipit-source-id: e56d2d48de0c65f6667790ab16337eac7f7d8b76