Adds stream recording for cross-stream uses of gradients in streaming backward (#60230)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/33909.
I _think_ the two recordDataPtrOnStreams i added are necessary and sufficient. They're the ones that worked for dmitrivainbrand's intricate multistream pipelining in https://github.com/pytorch/pytorch/issues/33909 and I can more or less convince myself they're enough, but it's hard to be sure (and hard to test).
PRing without a test now for visibility. I'll try to come up with something.
input_buffer.cpp needs to compile in cuda or cpu-only builds, so I can't call `c10::cuda::CUDACachingAllocator::recordStream` directly. I planned to work around by adding a binding in VirtualGuardImpl but https://github.com/pytorch/pytorch/pull/57047 spared me the trouble, thanks lw .
Recording a usage stream on a generic tensor was uglier than I expected, see https://github.com/pytorch/pytorch/issues/60306. Up to you guys if adding a unified way to record streams on a tensor backed by any TensorImpl should block this PR (and if so, whether it should happen in a separate PR or as part of this PR).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60230
Reviewed By: mrshenli
Differential Revision: D29289392
Pulled By: albanD
fbshipit-source-id: 1339d382b7d238a461b082597b3962847b5201fe