CUDA event should only be recorded after NCCL group (#8219)
Summary:
Otherwise, it won't work if we sync on this event.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/8219
Reviewed By: pietern
Differential Revision: D13788657
Pulled By: teng-li
fbshipit-source-id: 8c96e9691ed2441d7a685fb7ae8fece906f58daf