Fix flaky test_backward_node_failure_python_udf (#36969)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36969
`test_backward_node_failure_python_udf` was flaky since it used the
RPC framework to indicate rank 0 was done with processing. Since we kill nodes
in this unit test, it is very likely that listenLoop() has exited on some nodes
and hence using an RPC to inform all nodes about rank 0's completion
might not work, since the RPC might not be processed on certain nodes.
To fix this, we use the c10d store instead for this notification.
ghstack-source-id: 102549873
Test Plan: waitforbuildbot
Differential Revision: D21147099
fbshipit-source-id: 745273a6cae0debbae131bb4cc7debe9c201bf98