Reduce overhead when Future invokes callbacks inline (#57638)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57638
In RPC there are a few instances of "fastpaths" which do `if (fut.isCompleted()) { do_sth(); } else { fut.addCallback(do_sth); }`. I intend to get rid of them, for reasons I'll clarify later but which in a nutshell have to do with CUDA correctness and readability. Note that dropping the fastpath introduces no change in behavior (because `addCallback` invokes the callback inline anyways), thus the only perf concern comes from the fact that the fastpath avoids constructing and passing around a `std::function`. I don't think this is a significant performance hit. Regardless, this PR preemptively addresses this concern, by tweaking `addCallback` (and similar methods) so they can handle raw lambdas, and so that they do _not_ wrap them into `std::function`s if they are invoked inline.
ghstack-source-id: 129567067
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D28222808
fbshipit-source-id: eb1c7114cf7aca3403cb708f14287cab0907ecfa