[dist autograd] profile the amount of time spent executing (#35261)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35261
Uses the RECORD_FUNCTION macro to profile the amount of time in
dist_autograd and ensure that it shows up in the profiler output. Since
dist_autograd.backward() is blocking, we can avoid stuffing the RecordFunction
into a callback. This does not support profiling the RPCs that are created when
gradients are forwarded over to other nodes; this can be added in a follow up
diff.
ghstack-source-id: 100723408
Test Plan: Added a UT.
Differential Revision: D20611653
fbshipit-source-id: f9718cf488398a1c7b63ac3841bd2f4549082c8a