[Distributed] Move the cached all reduce token to C++ (#4912)
Summary:
For all the cc ops, we use a token to introduce control dependencies among them such that they will be executed in order. This token is cached in the Python layer and this pull request moves it to C++ given the upcoming pytorch/pytorch#93173 won't carry the token from Python to C++.
Test Plan:
CI.