Have FutureNCCL record streams w/ allocator in addCallback (#48496)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48496
This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).
---
There are two ways to add a callback to a Future: `then` and `addCallback` (with the former deferring to the latter). FutureNCCL only "patched" `then`, which caused `addCallback` to be unsupported. By patching `addCallback`, on the other hand, we cover both.
The high-level goal of this change though is to remove all CUDA-specific stuff from `then`, and move it to either `markCompleted` or to a wrapper around the callback. This will take a few more steps to achieve.
ghstack-source-id: 118180031
Test Plan: Unit tests
Reviewed By: mrshenli
Differential Revision: D25177558
fbshipit-source-id: ee0ad24eb2e56494c353db700319858ef9dcf32b