[Dist profiling] Fix ProcessGroupNCCL collective profiling (#55204)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55204
Implements a fix discussed offline with pritamdamia87 to run end callbacks after `CUDAFuture`'s wrapCallback has ensured appropriate synchronization. Also enables the relevant distributed profiling tests that were previously disabled for ProcessGroupNCCL.
Note that the profiling infrastructure has moved to primarily encourage the use of torch.profiler and CUPTI to trace CUDA kernels, support for distributed collectives for that will require further discussion with ilia-cher. However, this PR improves the usability of torch.autograd.profiler with respect to distributed collectives.
ghstack-source-id: 127357995
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D27491711
fbshipit-source-id: cec7703a4c5d59b5023b0aa8fef4c2e3fb8d37d0