Fix ProcessGroupNCCL profiling when profiler is not run with use_cuda (#48946)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48946
Move recordFunctionEndCallback to after the blocking portion of launching the NCCL kernel, and remove addCallback since it runs the lambda inline anyways, and triggers unnecessary CUDA stream logic. If we want CUDA operations such as NCCL kernels accurately profiled, we should use the profiler with use_cuda=True. However, we are currently debugging a deadlock for the use_cuda=True case, fix is being tracked in #48987.
To ensure that the tests are no longer flaky, submitted this PR to ci-all: #48947 and ran the test a bunch of times ssh'd into the CI machine.
ghstack-source-id: 118330130
Test Plan: Ci
Reviewed By: mrzzd
Differential Revision: D25368322
fbshipit-source-id: 7d17036248a3dcd855e58addc383bba64d6bc391