Make RequestCallback collect Futures from methods, rather than providing them (#57847)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57847
This is the first PR of a stack that aims to simplify RequestCallback, and I want to start by explaining my intentions.
With the introduction of CUDA support in the TensorPipe agent, we found out that other layers higher up in the stack (RRefs, dist autograd, ...) were not "ready" to support CUDA. One cause of this was that in PyTorch most CUDA state is thread-local, and the RequestCallback class (and others) might execute different steps of an operation on multiple threads. The solution to this problem is to preserve or recreate the CUDA state when switching between threads (propagating streams, or recording events and then wait on them). If we were to manually do this everywhere it would be tedious, error-prone, and hard to maintain.
In fact, we already have a primitive that can do this for us: CUDAFuture (now known as just Future). If whenever we switch threads we were to pack the values in a CUDAFuture and then unpack them on the other threads, all CUDA stuff would be taken care of for us.
If our code leveraged CUDAFuture at its core, thing would become the "idiomatic" thing to do, the natural behavior. Future changes would thus also be inclined to follow this pattern, hence automatically doing the right thing.
I also think that, even without these concerns about CUDA, there are benefits to use Futures more extensively. Currently RequestCallback uses a mix of Futures and callbacks. These are two tools for the same job, and thus mixing them creates confusion. Futures are more powerful than simple callbacks (they can be passed around, inspected, chained, waited on, ...) and thus should be preferred. They also lead to more readable code, as each step can be defined and chained in logical order, whereas callbacks must either be nested, or defined inline, or defined before and used later (thus making the code out-of-order).
In short: I intend to rework RequestCallback to use Futures much more. I believe it will greatly simplify the code, help readability, and prove invaluable to support CUDA.
---
Until now, we had the final result future being created at the very beginning, and then passed around everywhere, so that the various method could "fill in" its value. I think it's much lighter to instead allow each method to create or obtain its futures however it wants, and have it return them. I.e., have these futures "bubble up" from the lower layers, rather than them being "pushed down" from the upper ones.
In this initial PR, I move the place where we create this "final result future", but I still keep it around. I will then, in later PRs, slowly migrate each method so that it returns a future, and in the end I will avoid creating the final result future.
ghstack-source-id: 129567062
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D28224478
fbshipit-source-id: dbdc66b6458645a4a164c02f00d8618fa64da028