[CUDA graphs] Make stream semantics of backward calls consistent with other cuda ops (ci-all edition) (#57833)
Summary:
ci-all resubmit of https://github.com/pytorch/pytorch/pull/54227.
Tests look good except for a few distributed autograd failures (pytorch_linux_xenial_cuda10_2_cudnn7_py3_multigpu_test) and rocm failures (pr/pytorch-linux-bionic-rocm4.1-py3.6).
The common denominator in rocm failures appears to be multi-gpu activity: some [multiprocess DDP failures](https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-bionic-rocm4.1-py3.6-test1/8115/console), some [single-process failures](https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-bionic-rocm4.1-py3.6-test2/8115/console) where the single process has autograd ops that span devices. jeffdaily jithunnair-amd sunway513, could one of you take a look? The streaming backward change is also beneficial to rocm, I expect.
For debugging rocm failures, I think we should ignore the multiprocess/DDP tests and focus on the single process cases. The root cause is probably the same and the single process cases are simpler.
----------------------------------
Update: Rocm failures are due to https://github.com/pytorch/pytorch/issues/59750.
https://github.com/pytorch/pytorch/pull/57833/commits/2718a54032d0791ce90a9a95d15150c53727713e is a workaround, to be updated once https://github.com/pytorch/pytorch/issues/59750 is fixed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57833
Reviewed By: mruberry
Differential Revision: D28942391
Pulled By: ngimel
fbshipit-source-id: d6047e971c5f1c6386334bf3641402a92f12e2f8