pytorch
91ad3ed8 - Fix FutureNCCL not recording dataptrs with caching alloc in wait() (#48563)

Commit View On GitHub

Commit

3 years ago

Fix FutureNCCL not recording dataptrs with caching alloc in wait() (#48563) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48563 This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed). --- The CUDA caching allocator requires us to register all streams in which a DataPtr is used. We already do so when we invoke a callback, for which we obtain streams from the ATen pool. However, we didn't do so when the user waits for the Future and then uses the results in their current streams. This was probably fine in most cases, because the outputs of the NCCL ops (which is the tensors we're dealing with here) were user-provided, and thus already registered in some user streams, but in principle the user could use different streams when waiting than the ones they used to create the tensors. (If they use the same streams, registering becomes a no-op). But, more importantly, this change will help us turn FutureNCCL into a more general-purpose class as for example in RPC the tensors of the result are allocated by PyTorch itself and thus we need to record their usage on the user's streams with the caching allocator. ghstack-source-id: 118180033 Test Plan: Unit tests Reviewed By: mrshenli Differential Revision: D25210338 fbshipit-source-id: e0a4ba157653b74dd84cf5665c992ccce2dea188

Author

Committer

facebook-github-bot

Parents

003c30ba

pytorch 91ad3ed8 - Fix FutureNCCL not recording dataptrs with caching alloc in wait() (#48563)

Commit

pytorch
91ad3ed8 - Fix FutureNCCL not recording dataptrs with caching alloc in wait() (#48563)